Section 01
nanoPD: A Complete Prefill-Decode Separated LLM Inference Engine
nanoPD is a fully implemented Prefill-Decode separated LLM inference engine built from scratch. It addresses resource competition issues in LLM inference through custom paged KV cache, CUDA kernels, multi-GPU KV transfer, and adaptive routing. This thread will break down its background, architecture, core technologies, cost model, performance benchmarks, and practical implications.