Reading

nanoPD: An LLM Inference Engine with Separated Prefill and Decode Stages, Implemented from Scratch

A complete inference system with separated prefill and decode stages that solves resource competition issues in LLM inference through custom paged KV cache, CUDA kernels, multi-GPU KV transfer, and adaptive routing.

LLM推理分离式推理Prefill-Decode分页注意力CUDA内核多GPU自适应路由KV缓存

Published 2026-04-11 14:12Recent activity 2026-04-11 14:17Estimated read 8 min

nanoPD: An LLM Inference Engine with Separated Prefill and Decode Stages, Implemented from Scratch

Section 01

nanoPD: A Complete Prefill-Decode Separated LLM Inference Engine

nanoPD is a fully implemented Prefill-Decode separated LLM inference engine built from scratch. It addresses resource competition issues in LLM inference through custom paged KV cache, CUDA kernels, multi-GPU KV transfer, and adaptive routing. This thread will break down its background, architecture, core technologies, cost model, performance benchmarks, and practical implications.

Section 02

Background: Bottlenecks in Traditional LLM Inference & Rise of Separated Architecture

LLM inference consists of two distinct stages: computation-heavy Prefill (processing input prompts) and memory bandwidth-limited Decode (generating tokens sequentially). Traditional deployment runs both stages on the same GPU, leading to mutual interference and low resource utilization. The nanoPD project was developed to implement a complete separated inference system, serving as an excellent learning example for modern LLM service architectures.

Section 03

Architecture Design & Core Technical Innovations

Architecture Design

nanoPD's architecture has three layers:

CentralScheduler: Manages task distribution, KV transfer coordination, and path cost calculation.
Worker Nodes: Collocated Worker (handles both stages on one GPU), Prefill Worker (specialized for Prefill), Decode Worker (specialized for Decode).
Router: Chooses optimal execution paths using an analytical cost model.

Core Innovations

Paged KV Cache: Block-based memory management with Copy-on-Write for beam search/speculative decoding, improving memory efficiency.
Chunked Prefill: Splits long prompts into configurable chunks, interleaving with Decode steps to keep GPU utilization high.
Custom CUDA Paged Attention Kernel: Handwritten CUDA code for gather-scatter attention on non-continuous KV blocks.
Async KV Transfer: Uses dedicated CUDA streams for KV migration (fixed memory relay or P2P/NVLink) to overlap with computation.
Adaptive Router: Makes decisions based on hardware-fitted cost models (no offline training) and an online Bayesian output length predictor.

Section 04

Cost Model: Mathematical Basis for Routing Decisions

The router uses hardware-measured parameters to estimate end-to-end latency for collocated and separated strategies:

Parameter	Meaning	RTX4090×8	H20
α	Prefill latency (ms/token)	0.1247	0.1452
β	Decode step latency (ms, batch size=1)	51.56	33.10
batch_thresh	Memory-computation cross batch size	16	16
γ	Prefill interference on Decode (ms/token)	0.0869	0.1302
bandwidth	Inter-GPU transfer bandwidth (GB/s)	12.9	392

Key Decision Logic:

Separated strategy cost: transfer_rate × L (KV transfer cost).
Collocated strategy cost: γ × L × (load/batch_thresh) (interference cost).
Separated is better if γ/transfer_rate > batch_thresh/system_load.

Examples:

RTX4090: γ/transfer_rate ≈7.6 → separated better when system load ≥3.
H20: γ/transfer_rate≈346 → separated better at almost any non-zero load.

Section 05

Performance Evaluation: Benchmarks on Qwen3-8B

Benchmarks were conducted on Qwen3-8B with RTX4090×8 and H20:

Workload	Strategy	4090 p50	4090 p99	H20 p50	H20 p99
Short prompt	Collocated	6.4s	6.4s	4.9s	7.2s
Short prompt	Separated	9.2s	9.2s	4.9s	3.4s
Long prompt	Collocated	7.2s	7.3s	6.1s	10.2s
Long prompt	Separated	7.3s	~7s	8.4s	10.4s

Observations:

H20's high P2P bandwidth (392GB/s) makes KV transfer almost cost-free (short prompt separated equals collocated).
RTX4090's lower bandwidth (12.9GB/s) adds visible delay for separated strategy.
Adaptive strategy achieves throughput caps of ~240 tok/s (4090) and ~175 tok/s (H20). Collocated strategy has competitive low-load performance but worsens p99 latency at high concurrency.

Section 06

Code Organization & Real-World Insights

Code Organization

nanoPD's code is modular with detailed bilingual docs:

block_manager: BlockSpaceManager (paged KV allocation, CoW).
engine: ModelRunner (custom paged_forward hook), Engine (scheduling loop, chunked prefill).
paged_attention: CUDA C++ extensions for paged attention.
workers: Collocated/Prefill/Decode Workers, KV transfer logic.
router: Router (cost model wrapper), OutputLengthPredictor (Bayesian).
cost_model: Profiler (device microbenchmarks), analytical model (curve fitting).
benchmark: Static batch, Poisson arrival tests, auto-scan, plotting.

Practical Implications

Hardware-Aware Scheduling: Routing decisions should use actual hardware characteristics, not fixed rules.
Bandwidth Criticality: Inter-GPU bandwidth determines separated architecture success.
Adaptive Routing: No universal optimal strategy—adaptive is necessary.
Full-Stack Engineering: From CUDA kernels to scheduling, nanoPD provides reusable foundations for research and production.

nanoPD is both a technical demo and a complete system, offering a full path from theory to practice for LLM inference optimization. Its modular design and docs make it an excellent learning resource.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15