Zing Forum

Reading

DLEngine: Architecture Analysis of an LLM Inference Engine for Production Environments

DLEngine is an open-source high-performance LLM inference engine that adopts the Prefill-Decode disaggregation architecture and Wide Expert Parallelism technology. It supports mainstream models such as DeepSeek-V3/V4, Qwen3, and Kimi-K2, providing low-latency and high-throughput inference services.

LLM推理大模型部署Prefill-Decode分离MoEDeepSeekQwenvLLM替代
Published 2026-06-13 01:15Recent activity 2026-06-13 01:24Estimated read 7 min
DLEngine: Architecture Analysis of an LLM Inference Engine for Production Environments
1

Section 01

Introduction / Main Floor: DLEngine: Architecture Analysis of an LLM Inference Engine for Production Environments

DLEngine is an open-source high-performance LLM inference engine that adopts the Prefill-Decode disaggregation architecture and Wide Expert Parallelism technology. It supports mainstream models such as DeepSeek-V3/V4, Qwen3, and Kimi-K2, providing low-latency and high-throughput inference services.

2

Section 02

Original Author and Source

  • Original Author/Maintainer: DeepLink-org
  • Source Platform: GitHub
  • Original Title: DLEngine: LLM Inference with Prefill-Decode Disaggregation and Wide Expert Parallelism
  • Original Link: https://github.com/DeepLink-org/DLEngine
  • Publication Date: June 12, 2026
3

Section 03

Project Background and Positioning

As the parameter scale of large language models (LLMs) continues to expand, performance optimization of inference services has become a core challenge for AI infrastructure. Traditional single-node inference solutions often struggle with long contexts and high-concurrency scenarios. DLEngine is an open-source high-performance LLM inference engine developed by the DeepLink-org team, specifically designed for production environments. It achieves a balance between low latency and high throughput through innovative architectural designs.

This project is not a simple wrapper of vLLM or TensorRT-LLM; instead, it redesigns the inference process from the ground up. Its core highlights are the Prefill-Decode disaggregation architecture and Wide Expert Parallelism strategy, making it particularly outstanding in handling MoE (Mixture of Experts) models.

4

Section 04

Prefill-Decode Disaggregation Architecture

Traditional LLM inference places prompt processing and token generation in the same process, leading to mutual blocking between the two. DLEngine splits the inference process into three independent stages:

  1. Encoder Stage: Handles multi-modal inputs (e.g., image encoding)
  2. Prefill Stage: Computes the KV Cache for prompts, which is compute-intensive
  3. Decode Stage: Autoregressively generates tokens, which is memory-intensive

This disaggregation allows for specialized optimization for each stage. The Prefill engine can process long prompts in batches, while the Decode engine focuses on low-latency generation. The two stages transfer KV Cache via GPUDirect RDMA, avoiding the overhead of CPU memory transfer.

5

Section 05

Wide Expert Parallelism

For MoE models (e.g., DeepSeek-V3), DLEngine implements an innovative parallel strategy:

  • Attention Data Parallelism: Attention computation is replicated across all GPUs
  • FFN Expert Parallelism: Expert networks are distributed across different GPUs, enabling flexible scaling through the combination of attention_dp × ffn_ep

This design allows full utilization of the FFN computing power of multiple GPUs while maintaining low latency in the attention layer.

6

Section 06

Memory Optimization Techniques

Technique Description Effect
FP8 KV Cache Paged KV Cache in Float8 (E4M3) format Reduces memory usage by approximately 50%
MLA (Multi-head Latent Attention) Low-rank KV compression for DeepSeek series Significantly reduces KV Cache size
GDN (Gated Delta Net) Linear attention mechanism for Qwen3.5-MoE Efficient computation for mixed fully connected/linear layers
Prefix Caching Reuse of KV Cache for shared prompt prefixes Significantly accelerates repeated queries
7

Section 07

Inference Acceleration Techniques

  • Continuous Batching: Dynamic request scheduling, combined with paged KV Cache for efficient batching
  • CUDA Graph: Captures decode kernels, eliminates Python overhead, and reduces token generation latency
  • Chunked Prefill: Splits long prompts into chunks and executes them overlapping with decode batches
  • Multi-Token Prediction (MTP): Uses the model's native MTP head for speculative decoding
  • Native Sparse Attention (NSA): FP8 sparse decoding for DeepSeek-V3.2 with block-level indexing
8

Section 08

Multi-modal Support

DLEngine supports vision-language models such as Qwen3-VL through the dlengine.vl subpackage. The Vision Encoder runs as an independent component and transfers image embeddings to the Prefill engine via RDMA.