# ReMoE: Boosting Expert Reuse Rate of MoE Models via Router Fine-Tuning to Address Inference Bottlenecks in Memory-Constrained Scenarios

> The BUAA OSCAR team proposes the ReMoE framework, which enhances the expert reuse rate by 26% while maintaining model performance through fine-tuning the router's expert selection strategy. It achieves up to 2x decoding speedup on edge devices, providing a practical solution for deploying MoE models in resource-constrained environments.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-26T14:32:56.000Z
- 最近活动: 2026-05-27T05:19:39.814Z
- 热度: 118.2
- 关键词: MoE, 混合专家模型, 模型推理优化, 边缘计算, 缓存优化, vLLM, llama.cpp, 大模型部署
- 页面链接: https://www.zingnex.cn/en/forum/thread/remoe-moe
- Canonical: https://www.zingnex.cn/forum/thread/remoe-moe
- Markdown 来源: floors_fallback

---

## Core Interpretation of the ReMoE Framework: Boosting MoE Expert Reuse Rate to Break Through Memory-Constrained Inference Bottlenecks

The BUAA OSCAR team proposes the ReMoE framework, which increases the expert reuse rate by 26% while maintaining performance by fine-tuning the expert selection strategy of the MoE model's router. It achieves up to 2x decoding speedup on edge devices, providing a practical solution for deploying MoE models in resource-constrained environments.

**Key Information**: 
- Team: BUAA-OSCAR (Operating System and Compilation Optimization Research Group, Beihang University)
- Achievements: Expert reuse rate +26%, decoding speedup of 1.77-1.99x on edge devices
- Value: Addresses memory-constrained inference bottlenecks of MoE models, a paradigm for training-inference co-optimization
- Open-source code: https://github.com/BUAA-OSCAR/ReMoE
- Original paper link: http://arxiv.org/abs/2605.27081v1

## Background: Memory Bottleneck Issues in MoE Model Inference

Mixture-of-Experts (MoE) models reduce computational costs via sparse activation mechanisms, but face memory challenges during inference:
1. **Capacity Conflict**: Full parameters need to reside in memory to serve all inputs, but edge device GPU memory cannot accommodate all experts
2. **Cache Bottleneck**: Current strategies store active experts in high-speed memory; uncached experts need to be loaded from external storage, leading to frequent I/O delays

Take DeepSeek-V3 as an example: Total parameters are 671B, only 37B are activated per token, but all parameters need to stay in memory. Cache eviction and loading become efficiency constraints.

## Core Methods of ReMoE: Router Fine-Tuning and Three-Stage Training Process

### Core Idea
Leveraging the temporal locality of expert selection, fine-tune the router to introduce a "recently used expert" preference, encouraging reuse of recently activated experts to generate a temporally stable allocation pattern that matches cache locality, **with no inference-time overhead**.

### Three-Stage Training
1. **Reuse-Aware Fine-Tuning**: Introduce an auxiliary loss to reward the selection of recently used experts, balancing performance and reuse
2. **Load Balance Preservation**: Retain the original load balance loss to avoid expert idleness
3. **Downstream Calibration**: Lightweight downstream task calibration to ensure no performance degradation

## Experimental Results: Significant Improvements in Expert Reuse Rate and Inference Efficiency

### Key Data
- **Expert Reuse Rate**: Increased by 26% (reduces external storage loading by 26 times per 100 tokens)
- **vLLM GPU-CPU Offloading**: Throughput increased by 8.4%, end-to-end latency reduced
- **Edge Device Validation (Jetson Orin NX + llama.cpp)**: 
  - Time per Token (TPOT) reduced by 43.6%-49.8%
  - Decoding speedup of 1.77-1.99x

Tested models cover DeepSeek and Qwen series, with performance equal to or slightly better than the original models.

## Practical Significance and Insights: Training-Inference Co-Optimization Facilitates MoE Edge Deployment

### Practical Value
- **Engineering Pain Point Resolution**: Zero runtime overhead, seamless integration into existing training pipelines, no architecture modifications required
- **Deployment-Friendly**: Optimized models are compatible with standard inference frameworks and cache strategies

### Key Insights
**Training-Inference Co-Optimization**: Introducing deployment constraints (e.g., cache locality) during training can yield significant benefits without increasing inference complexity, providing a paradigm for scenarios like hardware feature adaptation and latency constraint optimization

### Summary
ReMoE removes barriers to deploying MoE models in resource-constrained environments, promoting the adoption of large models in edge, embedded, and other diverse computing environments.
