Zing Forum

Reading

Implementing LLM Inference Optimization Techniques from Scratch: KV Cache, Paged Attention, and PD Disaggregation

This article deeply analyzes the core technologies for accelerating large language model (LLM) inference, including KV Cache, Paged Attention, and Prefill/Decode Disaggregation (PD Disaggregation), and provides a step-by-step implementation guide from scratch.

大语言模型推理优化KV缓存分页注意力PD分离ORCA调度vLLM
Published 2026-04-19 19:03Recent activity 2026-04-19 19:20Estimated read 7 min
Implementing LLM Inference Optimization Techniques from Scratch: KV Cache, Paged Attention, and PD Disaggregation
1

Section 01

[Introduction] Implementing Core LLM Inference Optimization Technologies from Scratch: KV Cache, Paged Attention, and PD Disaggregation

This article deeply analyzes the core technologies for accelerating large language model (LLM) inference, including KV Cache, Paged Attention, and Prefill/Decode Disaggregation (PD Disaggregation), and provides an implementation guide from scratch. It also covers auxiliary optimization techniques such as ORCA iteration-level scheduling and ZeroMQ zero-copy communication, as well as key considerations for production environments like hardware configuration and model feature adaptation, helping developers understand and build efficient LLM inference services.

2

Section 02

Challenges in Inference Performance and Phase Division

LLM inference speed directly affects user experience and system costs. As model scales grow, latency and throughput become deployment bottlenecks. The inference process is divided into two phases: prefill (processing input prompts to generate the first token) and decode (autoregressively generating subsequent tokens). These two phases have distinct computational characteristics and optimization strategies: prefill is compute-intensive matrix multiplication, while decode is memory-intensive vector operation.

3

Section 03

KV Cache: The Key to Reducing Redundant Computation

In the Transformer self-attention mechanism, the Key/Value (KV) values of already generated tokens can be cached and reused to avoid redundant computation. Working principle: After generating the first token, save the KV tensors of each layer; for subsequent tokens, only compute their own Query vectors and perform attention calculation with the cached KV. Performance improvement: In CUDA environment, inference speed increases from 39.11 tokens/s to 42.68 tokens/s; in MPS environment, it jumps from 12.68 tokens/s to 33.73 tokens/s, which is particularly critical for long sequence generation.

4

Section 04

Paged Attention: Innovation in Efficient Memory Management

Traditional KV cache pre-allocates fixed continuous memory blocks, leading to waste. Paged Attention draws on the idea of virtual memory, splitting KV cache into fixed-size blocks (16/32 tokens) and dynamically allocating them on demand. Core mechanisms: Block tables record the mapping from logical blocks to physical blocks, support block sharing and copying, and physical blocks do not need to be stored continuously to eliminate fragmentation. Practical benefits: Improve GPU utilization, support more concurrent requests, and have been widely adopted by production-level engines like vLLM.

5

Section 05

PD Disaggregation: Optimization Strategy for Heterogeneous Computing

The prefill (compute-intensive) and decode (memory-intensive) phases have large characteristic differences. PD Disaggregation allocates them to different hardware resources. Architecture: Prefill nodes use high-computing-power GPUs to process inputs in parallel; decode nodes optimize memory bandwidth to generate tokens quickly. The two phases transfer KV states through efficient communication. Performance data: Throughput reaches 43.99 tokens/s in a simulated environment, and prefill takes only 0.######## seconds.

6

Section 06

Auxiliary Optimization Technologies: ORCA Scheduling and Zero-Copy Communication

The ORCA engine uses iteration-level scheduling, dynamically selecting batch requests for each generation iteration, supporting mixed-length sequences, dynamically adjusting batch size, and minimizing pipeline bubbles. ZeroMQ (ZMQ) is used for request distribution between the front end and the engine, multi-GPU KV synchronization, and streaming result pushing. Its publish-subscribe and request-reply modes adapt to the communication needs of LLM services.

7

Section 07

Implementation Key Points and Learning Path

Recommended learning path for developers: 1. Basic implementation: Understand the autoregressive generation process from greedy sampling; 2. Add KV cache: Observe performance improvement; 3. Paged Attention: Implement block-level memory management and understand virtual memory mapping; 4. Advanced scheduling: Explore ORCA iteration-level scheduling and PD disaggregation architecture. Each phase needs to be accompanied by performance benchmarking to quantify the effect.

8

Section 08

Key Considerations for Production Environments

Production deployment needs to consider: Hardware configuration (PD disaggregation requires a dedicated topology, Paged Attention requires sufficient memory bandwidth); Model characteristics (KV cache layout differences in architectures like Llama/GPT/Mistral need targeted adjustments); Service level objectives (real-time dialogue prioritizes low latency, batch tasks pursue high throughput).