Section 01
NPUMoE: A New Breakthrough in Efficiently Running MoE Large Models on Apple Silicon NPU
The research team proposes the NPUMoE inference engine, which successfully offloads Mixture-of-Experts (MoE) large model inference to the Apple Neural Engine (ANE) through techniques like static layering, grouped execution, and load awareness. This achieves significant performance and energy efficiency improvements, providing a solution for edge devices to run large models efficiently.