Section 01
Introduction: ReaLB—A New Real-Time Load Balancing Scheme for Multimodal MoE Inference
ReaLB is an innovative solution for the load imbalance problem in multimodal MoE inference. Its core lies in dynamically adjusting the computational precision of experts (e.g., using FP4 low precision for vision-intensive tasks). Without additional scheduling overhead or memory increase, it achieves a 1.29x speedup with precision loss controlled within 1.2%, providing an efficient solution for the production deployment of large multimodal models.