Reading

SurgeLLM: CPU/NPU Hybrid Inference Runtime Unlocks Ultra-Large Sparse MoE Models

SurgeLLM enables inference of large sparse MoE language models beyond the memory limits of NPUs by coordinating host memory, CPU computation, and NPU acceleration. Its first target is Qwen3.6-35B-A3B on the Ascend 310P3.

MoE稀疏专家模型NPU推理昇腾混合计算边缘部署Qwen大模型推理CPU/NPU协同模型量化

Published 2026-05-19 17:44Recent activity 2026-05-19 17:50Estimated read 6 min

SurgeLLM: CPU/NPU Hybrid Inference Runtime Unlocks Ultra-Large Sparse MoE Models

Section 01

Introduction: SurgeLLM Unlocks CPU/NPU Hybrid Inference for Ultra-Large Sparse MoE Models

SurgeLLM is a C++-implemented CPU/NPU hybrid inference runtime designed to break through NPU memory limitations, enabling ultra-large sparse MoE language models (e.g., Qwen3.6-35B-A 3B) to run efficiently on edge NPU devices like the Ascend 310P3. By coordinating host memory, CPU computation, and NPU acceleration, it adopts an explicit control design philosophy, allowing developers to perform fine-grained tuning for specific hardware and model characteristics, thus solving hardware bottlenecks in MoE model deployment.

Section 02

Background: Hardware Bottlenecks in MoE Model Inference

Mixture of Experts (MoE) models are an important technical path for scaling the capabilities of large language models, but they face unique deployment challenges: model weights are large, far exceeding the memory capacity of a single NPU/GPU; the dynamic nature of expert routing makes memory access patterns hard to predict; traditional pure NPU inference solutions cannot handle weights at the hundreds of GB level. SurgeLLM addresses this pain point by breaking the assumption that "models must be fully loaded into NPU memory", enabling ultra-large MoE models to run on edge devices through CPU-NPU collaborative computing.

Section 03

Technical Architecture: Hybrid Mode of Host Memory + NPU Computing

The core architectural innovation of SurgeLLM lies in keeping the complete model weights in host memory, while only offloading selected computation paths and cached data to the NPU. The system dynamically loads active expert weights into the NPU for execution based on the routing decisions of input tokens, avoiding memory limitations. It uses a model adapter architecture, initially supporting the Qwen3.6 MoE series, and can later be extended to other MoE architectures without rewriting the core runtime.

Section 04

Development Phases and Implementation Path

SurgeLLM development follows a progressive verification strategy: the first phase focuses on pure text local inference (batch size=1), prioritizing the correctness of short contexts before optimizing for long contexts. In implementation, a pure CPU reference path is first built to ensure logical correctness and numerical stability, then the Ascend 310P3 hybrid acceleration path is added. It uses a CMake build system and provides a complete debugging toolchain including model inspection, weight mapping, and runtime simulation, lowering the adaptation threshold.

Section 05

Detailed Explanation of Developer Toolchain

SurgeLLM is equipped with Python auxiliary tools covering the entire lifecycle: inspect_model to view the standardized representation of model configurations; the inspect_weights series tools to check safetensors indices and metadata; build_manifest to convert model configurations into runtime manifests; query_weight and check_payload to verify the correctness of data loading; mock_runtime and mock_execute to simulate inference processes and memory layouts without NPU hardware.

Section 06

Application Scenarios and Value

SurgeLLM is suitable for enterprise applications deploying ultra-large MoE models on edge NPUs, AI services with limited memory but needing to enhance model capabilities, and large-scale deployments concerned with hardware costs. For the Ascend ecosystem, it demonstrates the feasibility of domestic NPUs in large MoE inference and provides a reference implementation for the domestic AI infrastructure software ecosystem.

Section 07

Technical Challenges and Future Outlook

CPU/NPU hybrid inference faces challenges such as memory bandwidth bottlenecks, expert switching delays, and cross-platform porting complexity. Future directions include: intelligent expert caching algorithms to reduce transmission overhead, supporting larger batch sizes to improve throughput, memory optimization for long contexts, and expanding to more MoE model families. Hybrid inference runtimes will play a more important role in edge deployment of large models.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15