Section 01
Introduction to ST-MoE: Accelerating Large MoE Model Inference via Spatiotemporal Expert Prefetching
Mixture of Experts (MoE) is a mainstream approach for scaling large language models, but dynamic expert activation leads to severe expert loading latency issues. The ST-MoE framework, by mining the spatiotemporal correlation of expert activation and combining a lightweight prediction mechanism with reconfigurable hardware design, overlaps expert loading with computation, significantly improving the inference performance and energy efficiency of MoE models while maintaining model accuracy.