Zing Forum

Reading

ReMoE: Boosting Expert Reuse Rate of MoE Models via Router Fine-Tuning to Address Inference Bottlenecks in Memory-Constrained Scenarios

The BUAA OSCAR team proposes the ReMoE framework, which enhances the expert reuse rate by 26% while maintaining model performance through fine-tuning the router's expert selection strategy. It achieves up to 2x decoding speedup on edge devices, providing a practical solution for deploying MoE models in resource-constrained environments.

MoE混合专家模型模型推理优化边缘计算缓存优化vLLMllama.cpp大模型部署
Published 2026-05-26 22:32Recent activity 2026-05-27 13:19Estimated read 6 min
ReMoE: Boosting Expert Reuse Rate of MoE Models via Router Fine-Tuning to Address Inference Bottlenecks in Memory-Constrained Scenarios
1

Section 01

Core Interpretation of the ReMoE Framework: Boosting MoE Expert Reuse Rate to Break Through Memory-Constrained Inference Bottlenecks

The BUAA OSCAR team proposes the ReMoE framework, which increases the expert reuse rate by 26% while maintaining performance by fine-tuning the expert selection strategy of the MoE model's router. It achieves up to 2x decoding speedup on edge devices, providing a practical solution for deploying MoE models in resource-constrained environments.

Key Information:

  • Team: BUAA-OSCAR (Operating System and Compilation Optimization Research Group, Beihang University)
  • Achievements: Expert reuse rate +26%, decoding speedup of 1.77-1.99x on edge devices
  • Value: Addresses memory-constrained inference bottlenecks of MoE models, a paradigm for training-inference co-optimization
  • Open-source code: https://github.com/BUAA-OSCAR/ReMoE
  • Original paper link: http://arxiv.org/abs/2605.27081v1
2

Section 02

Background: Memory Bottleneck Issues in MoE Model Inference

Mixture-of-Experts (MoE) models reduce computational costs via sparse activation mechanisms, but face memory challenges during inference:

  1. Capacity Conflict: Full parameters need to reside in memory to serve all inputs, but edge device GPU memory cannot accommodate all experts
  2. Cache Bottleneck: Current strategies store active experts in high-speed memory; uncached experts need to be loaded from external storage, leading to frequent I/O delays

Take DeepSeek-V3 as an example: Total parameters are 671B, only 37B are activated per token, but all parameters need to stay in memory. Cache eviction and loading become efficiency constraints.

3

Section 03

Core Methods of ReMoE: Router Fine-Tuning and Three-Stage Training Process

Core Idea

Leveraging the temporal locality of expert selection, fine-tune the router to introduce a "recently used expert" preference, encouraging reuse of recently activated experts to generate a temporally stable allocation pattern that matches cache locality, with no inference-time overhead.

Three-Stage Training

  1. Reuse-Aware Fine-Tuning: Introduce an auxiliary loss to reward the selection of recently used experts, balancing performance and reuse
  2. Load Balance Preservation: Retain the original load balance loss to avoid expert idleness
  3. Downstream Calibration: Lightweight downstream task calibration to ensure no performance degradation
4

Section 04

Experimental Results: Significant Improvements in Expert Reuse Rate and Inference Efficiency

Key Data

  • Expert Reuse Rate: Increased by 26% (reduces external storage loading by 26 times per 100 tokens)
  • vLLM GPU-CPU Offloading: Throughput increased by 8.4%, end-to-end latency reduced
  • Edge Device Validation (Jetson Orin NX + llama.cpp):
    • Time per Token (TPOT) reduced by 43.6%-49.8%
    • Decoding speedup of 1.77-1.99x

Tested models cover DeepSeek and Qwen series, with performance equal to or slightly better than the original models.

5

Section 05

Practical Significance and Insights: Training-Inference Co-Optimization Facilitates MoE Edge Deployment

Practical Value

  • Engineering Pain Point Resolution: Zero runtime overhead, seamless integration into existing training pipelines, no architecture modifications required
  • Deployment-Friendly: Optimized models are compatible with standard inference frameworks and cache strategies

Key Insights

Training-Inference Co-Optimization: Introducing deployment constraints (e.g., cache locality) during training can yield significant benefits without increasing inference complexity, providing a paradigm for scenarios like hardware feature adaptation and latency constraint optimization

Summary

ReMoE removes barriers to deploying MoE models in resource-constrained environments, promoting the adoption of large models in edge, embedded, and other diverse computing environments.