# SurgeLLM: CPU/NPU Hybrid Inference Runtime Unlocks Ultra-Large Sparse MoE Models

> SurgeLLM enables inference of large sparse MoE language models beyond the memory limits of NPUs by coordinating host memory, CPU computation, and NPU acceleration. Its first target is Qwen3.6-35B-A3B on the Ascend 310P3.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-19T09:44:40.000Z
- 最近活动: 2026-05-19T09:50:59.095Z
- 热度: 154.9
- 关键词: MoE, 稀疏专家模型, NPU推理, 昇腾, 混合计算, 边缘部署, Qwen, 大模型推理, CPU/NPU协同, 模型量化
- 页面链接: https://www.zingnex.cn/en/forum/thread/surgellm-cpu-npumoe
- Canonical: https://www.zingnex.cn/forum/thread/surgellm-cpu-npumoe
- Markdown 来源: floors_fallback

---

## Introduction: SurgeLLM Unlocks CPU/NPU Hybrid Inference for Ultra-Large Sparse MoE Models

SurgeLLM is a C++-implemented CPU/NPU hybrid inference runtime designed to break through NPU memory limitations, enabling ultra-large sparse MoE language models (e.g., Qwen3.6-35B-A
3B) to run efficiently on edge NPU devices like the Ascend 310P3. By coordinating host memory, CPU computation, and NPU acceleration, it adopts an explicit control design philosophy, allowing developers to perform fine-grained tuning for specific hardware and model characteristics, thus solving hardware bottlenecks in MoE model deployment.

## Background: Hardware Bottlenecks in MoE Model Inference

Mixture of Experts (MoE) models are an important technical path for scaling the capabilities of large language models, but they face unique deployment challenges: model weights are large, far exceeding the memory capacity of a single NPU/GPU; the dynamic nature of expert routing makes memory access patterns hard to predict; traditional pure NPU inference solutions cannot handle weights at the hundreds of GB level. SurgeLLM addresses this pain point by breaking the assumption that "models must be fully loaded into NPU memory", enabling ultra-large MoE models to run on edge devices through CPU-NPU collaborative computing.

## Technical Architecture: Hybrid Mode of Host Memory + NPU Computing

The core architectural innovation of SurgeLLM lies in keeping the complete model weights in host memory, while only offloading selected computation paths and cached data to the NPU. The system dynamically loads active expert weights into the NPU for execution based on the routing decisions of input tokens, avoiding memory limitations. It uses a model adapter architecture, initially supporting the Qwen3.6 MoE series, and can later be extended to other MoE architectures without rewriting the core runtime.

## Development Phases and Implementation Path

SurgeLLM development follows a progressive verification strategy: the first phase focuses on pure text local inference (batch size=1), prioritizing the correctness of short contexts before optimizing for long contexts. In implementation, a pure CPU reference path is first built to ensure logical correctness and numerical stability, then the Ascend 310P3 hybrid acceleration path is added. It uses a CMake build system and provides a complete debugging toolchain including model inspection, weight mapping, and runtime simulation, lowering the adaptation threshold.

## Detailed Explanation of Developer Toolchain

SurgeLLM is equipped with Python auxiliary tools covering the entire lifecycle: inspect_model to view the standardized representation of model configurations; the inspect_weights series tools to check safetensors indices and metadata; build_manifest to convert model configurations into runtime manifests; query_weight and check_payload to verify the correctness of data loading; mock_runtime and mock_execute to simulate inference processes and memory layouts without NPU hardware.

## Application Scenarios and Value

SurgeLLM is suitable for enterprise applications deploying ultra-large MoE models on edge NPUs, AI services with limited memory but needing to enhance model capabilities, and large-scale deployments concerned with hardware costs. For the Ascend ecosystem, it demonstrates the feasibility of domestic NPUs in large MoE inference and provides a reference implementation for the domestic AI infrastructure software ecosystem.

## Technical Challenges and Future Outlook

CPU/NPU hybrid inference faces challenges such as memory bandwidth bottlenecks, expert switching delays, and cross-platform porting complexity. Future directions include: intelligent expert caching algorithms to reduce transmission overhead, supporting larger batch sizes to improve throughput, memory optimization for long contexts, and expanding to more MoE model families. Hybrid inference runtimes will play a more important role in edge deployment of large models.
