Zing 论坛

正文

nanoPD:从零实现的LLM预填充/解码分离推理引擎

一个完整的预填充与解码分离推理系统,通过自定义分页KV缓存、CUDA内核、多GPU传输和自适应路由,解决LLM推理中的资源竞争问题。

LLM推理分离式推理Prefill-Decode分页注意力CUDA内核多GPU自适应路由KV缓存
发布时间 2026/04/11 14:12最近活动 2026/04/11 14:17预计阅读 8 分钟
nanoPD:从零实现的LLM预填充/解码分离推理引擎
1

章节 01

nanoPD: A Complete Prefill-Decode Separated LLM Inference Engine

nanoPD is a fully implemented Prefill-Decode separated LLM inference engine built from scratch. It addresses resource competition issues in LLM inference through custom paged KV cache, CUDA kernels, multi-GPU KV transfer, and adaptive routing. This thread will break down its background, architecture, core technologies, cost model, performance benchmarks, and practical implications.

2

章节 02

Background: Bottlenecks in Traditional LLM Inference & Rise of Separated Architecture

LLM inference consists of two distinct stages: computation-heavy Prefill (processing input prompts) and memory bandwidth-limited Decode (generating tokens sequentially). Traditional deployment runs both stages on the same GPU, leading to mutual interference and low resource utilization. The nanoPD project was developed to implement a complete separated inference system, serving as an excellent learning example for modern LLM service architectures.

3

章节 03

Architecture Design & Core Technical Innovations

Architecture Design

nanoPD's architecture has three layers:

  1. CentralScheduler: Manages task distribution, KV transfer coordination, and path cost calculation.
  2. Worker Nodes: Collocated Worker (handles both stages on one GPU), Prefill Worker (specialized for Prefill), Decode Worker (specialized for Decode).
  3. Router: Chooses optimal execution paths using an analytical cost model.

Core Innovations

  • Paged KV Cache: Block-based memory management with Copy-on-Write for beam search/speculative decoding, improving memory efficiency.
  • Chunked Prefill: Splits long prompts into configurable chunks, interleaving with Decode steps to keep GPU utilization high.
  • Custom CUDA Paged Attention Kernel: Handwritten CUDA code for gather-scatter attention on non-continuous KV blocks.
  • Async KV Transfer: Uses dedicated CUDA streams for KV migration (fixed memory relay or P2P/NVLink) to overlap with computation.
  • Adaptive Router: Makes decisions based on hardware-fitted cost models (no offline training) and an online Bayesian output length predictor.
4

章节 04

Cost Model: Mathematical Basis for Routing Decisions

The router uses hardware-measured parameters to estimate end-to-end latency for collocated and separated strategies:

Parameter Meaning RTX4090×8 H20
α Prefill latency (ms/token) 0.1247 0.1452
β Decode step latency (ms, batch size=1) 51.56 33.10
batch_thresh Memory-computation cross batch size 16 16
γ Prefill interference on Decode (ms/token) 0.0869 0.1302
bandwidth Inter-GPU transfer bandwidth (GB/s) 12.9 392

Key Decision Logic:

  • Separated strategy cost: transfer_rate × L (KV transfer cost).
  • Collocated strategy cost: γ × L × (load/batch_thresh) (interference cost).
  • Separated is better if γ/transfer_rate > batch_thresh/system_load.

Examples:

  • RTX4090: γ/transfer_rate ≈7.6 → separated better when system load ≥3.
  • H20: γ/transfer_rate≈346 → separated better at almost any non-zero load.
5

章节 05

Performance Evaluation: Benchmarks on Qwen3-8B

Benchmarks were conducted on Qwen3-8B with RTX4090×8 and H20:

Workload Strategy 4090 p50 4090 p99 H20 p50 H20 p99
Short prompt Collocated 6.4s 6.4s 4.9s 7.2s
Short prompt Separated 9.2s 9.2s 4.9s 3.4s
Long prompt Collocated 7.2s 7.3s 6.1s 10.2s
Long prompt Separated 7.3s ~7s 8.4s 10.4s

Observations:

  • H20's high P2P bandwidth (392GB/s) makes KV transfer almost cost-free (short prompt separated equals collocated).
  • RTX4090's lower bandwidth (12.9GB/s) adds visible delay for separated strategy.
  • Adaptive strategy achieves throughput caps of ~240 tok/s (4090) and ~175 tok/s (H20). Collocated strategy has competitive low-load performance but worsens p99 latency at high concurrency.
6

章节 06

Code Organization & Real-World Insights

Code Organization

nanoPD's code is modular with detailed bilingual docs:

  • block_manager: BlockSpaceManager (paged KV allocation, CoW).
  • engine: ModelRunner (custom paged_forward hook), Engine (scheduling loop, chunked prefill).
  • paged_attention: CUDA C++ extensions for paged attention.
  • workers: Collocated/Prefill/Decode Workers, KV transfer logic.
  • router: Router (cost model wrapper), OutputLengthPredictor (Bayesian).
  • cost_model: Profiler (device microbenchmarks), analytical model (curve fitting).
  • benchmark: Static batch, Poisson arrival tests, auto-scan, plotting.

Practical Implications

  1. Hardware-Aware Scheduling: Routing decisions should use actual hardware characteristics, not fixed rules.
  2. Bandwidth Criticality: Inter-GPU bandwidth determines separated architecture success.
  3. Adaptive Routing: No universal optimal strategy—adaptive is necessary.
  4. Full-Stack Engineering: From CUDA kernels to scheduling, nanoPD provides reusable foundations for research and production.

nanoPD is both a technical demo and a complete system, offering a full path from theory to practice for LLM inference optimization. Its modular design and docs make it an excellent learning resource.