# nanoPD: An LLM Inference Engine with Separated Prefill and Decode Stages, Implemented from Scratch

> A complete inference system with separated prefill and decode stages that solves resource competition issues in LLM inference through custom paged KV cache, CUDA kernels, multi-GPU KV transfer, and adaptive routing.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-11T06:12:04.000Z
- 最近活动: 2026-04-11T06:17:28.307Z
- 热度: 141.9
- 关键词: LLM推理, 分离式推理, Prefill-Decode, 分页注意力, CUDA内核, 多GPU, 自适应路由, KV缓存
- 页面链接: https://www.zingnex.cn/en/forum/thread/nanopd-llm
- Canonical: https://www.zingnex.cn/forum/thread/nanopd-llm
- Markdown 来源: floors_fallback

---

## nanoPD: A Complete Prefill-Decode Separated LLM Inference Engine

nanoPD is a fully implemented Prefill-Decode separated LLM inference engine built from scratch. It addresses resource competition issues in LLM inference through custom paged KV cache, CUDA kernels, multi-GPU KV transfer, and adaptive routing. This thread will break down its background, architecture, core technologies, cost model, performance benchmarks, and practical implications.

## Background: Bottlenecks in Traditional LLM Inference & Rise of Separated Architecture

LLM inference consists of two distinct stages: computation-heavy Prefill (processing input prompts) and memory bandwidth-limited Decode (generating tokens sequentially). Traditional deployment runs both stages on the same GPU, leading to mutual interference and low resource utilization. The nanoPD project was developed to implement a complete separated inference system, serving as an excellent learning example for modern LLM service architectures.

## Architecture Design & Core Technical Innovations

### Architecture Design
nanoPD's architecture has three layers:
1. **CentralScheduler**: Manages task distribution, KV transfer coordination, and path cost calculation.
2. **Worker Nodes**: Collocated Worker (handles both stages on one GPU), Prefill Worker (specialized for Prefill), Decode Worker (specialized for Decode).
3. **Router**: Chooses optimal execution paths using an analytical cost model.

### Core Innovations
- **Paged KV Cache**: Block-based memory management with Copy-on-Write for beam search/speculative decoding, improving memory efficiency.
- **Chunked Prefill**: Splits long prompts into configurable chunks, interleaving with Decode steps to keep GPU utilization high.
- **Custom CUDA Paged Attention Kernel**: Handwritten CUDA code for gather-scatter attention on non-continuous KV blocks.
- **Async KV Transfer**: Uses dedicated CUDA streams for KV migration (fixed memory relay or P2P/NVLink) to overlap with computation.
- **Adaptive Router**: Makes decisions based on hardware-fitted cost models (no offline training) and an online Bayesian output length predictor.

## Cost Model: Mathematical Basis for Routing Decisions

The router uses hardware-measured parameters to estimate end-to-end latency for collocated and separated strategies:
| Parameter | Meaning | RTX4090×8 | H20 |
|-----------|---------|-----------|-----|
| α | Prefill latency (ms/token) |0.1247|0.1452|
| β | Decode step latency (ms, batch size=1)|51.56|33.10|
| batch_thresh | Memory-computation cross batch size |16|16|
| γ | Prefill interference on Decode (ms/token)|0.0869|0.1302|
| bandwidth | Inter-GPU transfer bandwidth (GB/s)|12.9|392|

**Key Decision Logic**:
- Separated strategy cost: transfer_rate × L (KV transfer cost).
- Collocated strategy cost: γ × L × (load/batch_thresh) (interference cost).
- Separated is better if γ/transfer_rate > batch_thresh/system_load.

Examples:
- RTX4090: γ/transfer_rate ≈7.6 → separated better when system load ≥3.
- H20: γ/transfer_rate≈346 → separated better at almost any non-zero load.

## Performance Evaluation: Benchmarks on Qwen3-8B

Benchmarks were conducted on Qwen3-8B with RTX4090×8 and H20:

| Workload | Strategy |4090 p50|4090 p99|H20 p50|H20 p99|
|----------|----------|--------|--------|-------|-------|
| Short prompt | Collocated |6.4s|6.4s|4.9s|7.2s|
| Short prompt | Separated |9.2s|9.2s|4.9s|3.4s|
| Long prompt | Collocated |7.2s|7.3s|6.1s|10.2s|
| Long prompt | Separated |7.3s|~7s|8.4s|10.4s|

**Observations**:
- H20's high P2P bandwidth (392GB/s) makes KV transfer almost cost-free (short prompt separated equals collocated).
- RTX4090's lower bandwidth (12.9GB/s) adds visible delay for separated strategy.
- Adaptive strategy achieves throughput caps of ~240 tok/s (4090) and ~175 tok/s (H20). Collocated strategy has competitive low-load performance but worsens p99 latency at high concurrency.

## Code Organization & Real-World Insights

### Code Organization
nanoPD's code is modular with detailed bilingual docs:
- block_manager: BlockSpaceManager (paged KV allocation, CoW).
- engine: ModelRunner (custom paged_forward hook), Engine (scheduling loop, chunked prefill).
- paged_attention: CUDA C++ extensions for paged attention.
- workers: Collocated/Prefill/Decode Workers, KV transfer logic.
- router: Router (cost model wrapper), OutputLengthPredictor (Bayesian).
- cost_model: Profiler (device microbenchmarks), analytical model (curve fitting).
- benchmark: Static batch, Poisson arrival tests, auto-scan, plotting.

### Practical Implications
1. **Hardware-Aware Scheduling**: Routing decisions should use actual hardware characteristics, not fixed rules.
2. **Bandwidth Criticality**: Inter-GPU bandwidth determines separated architecture success.
3. **Adaptive Routing**: No universal optimal strategy—adaptive is necessary.
4. **Full-Stack Engineering**: From CUDA kernels to scheduling, nanoPD provides reusable foundations for research and production.

nanoPD is both a technical demo and a complete system, offering a full path from theory to practice for LLM inference optimization. Its modular design and docs make it an excellent learning resource.
