Zing Forum

Reading

NanoDeploy: A High-Performance Large Model Inference Engine for Production Environments

DeepLink's open-source LLM inference engine achieves high-throughput, low-latency large-scale model service deployment through Prefill-Decode separation, wide expert parallelism, and EPD architecture, supporting mainstream models such as DeepSeek, Qwen, and Kimi.

大模型推理LLM部署Prefill-Decode分离专家并行MoEDeepSeekQwen高性能计算RDMA推理优化
Published 2026-05-12 18:06Recent activity 2026-05-12 18:23Estimated read 6 min
NanoDeploy: A High-Performance Large Model Inference Engine for Production Environments
1

Section 01

NanoDeploy: Introduction to the High-Performance Large Model Inference Engine for Production Environments

NanoDeploy is an open-source LLM inference engine developed by the DeepLink team, designed to meet the high concurrency demands of production environments. Through innovative architectures and optimization techniques such as Prefill-Decode separation and wide expert parallelism, it achieves high throughput and low latency, supports mainstream models like DeepSeek, Qwen, and Kimi, and provides an efficient solution for large-scale model service deployment.

2

Section 02

R&D Background and Technical Positioning of NanoDeploy

With the widespread application of LLMs across various industries, efficient and stable inference services in high-concurrency scenarios have become a core challenge for AI infrastructure. NanoDeploy is positioned as a high-performance inference engine for production environments, with core design principles of decoupling and parallelism. It decomposes the end-to-end inference process into independently scalable components, improving resource utilization efficiency and cluster scheduling flexibility.

3

Section 03

Core Architecture Components of NanoDeploy

NanoDeploy adopts a microservices architecture, consisting of four core components:

  1. NanoRoute: An intelligent traffic gateway written in Rust, providing OpenAI-compatible APIs, responsible for request distribution and advanced feature support;
  2. NanoCtrl: A service governance center implemented in Rust, managing engine node registration, monitoring, and lifecycle based on Redis;
  3. Inference Execution Engine: Implemented in Python/C++, supporting separate deployment of Prefill/Decode, responsible for inference computation and distributed management;
  4. NanoDeployVL: A vision-language encoder that supports EP-separated ViT and RDMA transmission, adapting to multimodal models.
4

Section 04

Innovative Technical Design: Separation Architecture and Wide Expert Parallelism

  1. Prefill-Decode Separation: Separates compute-intensive prompt processing (Prefill) and memory-intensive token generation (Decode) onto different GPU nodes, migrates KV Cache via RDMA, and optimizes resource allocation based on the characteristics of each phase;
  2. Wide Expert Parallelism: For MoE models, distributes experts across all GPUs while maintaining data parallelism in attention layers, achieving load balancing, high scalability, and communication optimization.
5

Section 05

Key Optimization Features to Enhance Inference Performance

  • Continuous Batching and Dynamic Scheduling: Dynamically adds requests and combines paged KV Cache to improve GPU utilization;
  • FP8 KV Cache: Reduces cache usage by approximately 50% and supports longer sequences;
  • Prefix Cache: Reuses KV Cache of shared prompts to avoid redundant computation;
  • Multi-Token Prediction: Accelerates generation via speculative decoding;
  • Native Sparse Attention: Efficiently handles sparse patterns and reduces overhead for long sequences.
6

Section 06

Model Ecosystem and High-Performance Kernel Support

  1. Model Ecosystem: Adapts to mainstream models such as DeepSeek-V3/V3.2/V4, GLM-5, Kimi-K2, and Qwen3 series, covering dense and MoE architectures;
  2. Performance Kernels: Integrates high-performance libraries like DeepEP, DeepGEMM, FlashMLA, FlashInfer, and DLSlime, and fully leverages the capabilities of Hopper architecture GPUs.
7

Section 07

Deployment Modes and Industry Impact

Deployment modes include non-separated (small to medium scale), separated (large scale and high concurrency), and HTTP service (OpenAI-compatible API). NanoDeploy represents the latest direction in inference infrastructure; its open-source technology drives industry efficiency improvements, provides enterprises with a fully functional open-source option, and its modular design facilitates secondary development and customization.