Section 01
Introduction: SurgeLLM Unlocks CPU/NPU Hybrid Inference for Ultra-Large Sparse MoE Models
SurgeLLM is a C++-implemented CPU/NPU hybrid inference runtime designed to break through NPU memory limitations, enabling ultra-large sparse MoE language models (e.g., Qwen3.6-35B-A 3B) to run efficiently on edge NPU devices like the Ascend 310P3. By coordinating host memory, CPU computation, and NPU acceleration, it adopts an explicit control design philosophy, allowing developers to perform fine-grained tuning for specific hardware and model characteristics, thus solving hardware bottlenecks in MoE model deployment.