Zing Forum

Reading

Modern Machine Learning Systems Study Notes: From PagedAttention to LLM Inference Optimization

An in-depth interpretation of an open-source ML systems study notes repository, covering principle analysis and implementation details of cutting-edge technologies such as PagedAttention, vLLM multi-GPU parallelism, diffusion model acceleration, ORCA scheduling, etc.

机器学习系统LLM推理PagedAttentionvLLM张量并行扩散模型ORCASarathi推理优化内存管理
Published 2026-05-21 11:45Recent activity 2026-05-21 11:56Estimated read 7 min
Modern Machine Learning Systems Study Notes: From PagedAttention to LLM Inference Optimization
1

Section 01

Guide to Modern Machine Learning Systems Study Notes

With the rapid development of large language models (LLMs), machine learning has evolved from algorithm research to complex systems engineering. System issues such as inference efficiency, deployment architecture, and memory management determine the real-world implementation of AI products. The open-source study notes repository introduced in this article organizes a knowledge system from bottom-level optimization to upper-level architecture through paper reading, source code analysis, and experiments. It covers cutting-edge technologies like PagedAttention, vLLM multi-GPU parallelism, diffusion model acceleration, ORCA scheduling, and Sarathi-Serve, providing valuable references for ML system engineers and researchers.

2

Section 02

Background and Challenges of ML Systems Engineering

The core challenges faced by ML systems include: 1. KV Cache pre-allocation strategy leads to memory waste and fragmentation (long-context models require reserving large continuous memory even if the actual generation length is short); 2. Traditional request-level batching causes severe tail latency due to sequence length differences; 3. Unbalanced resource utilization due to different computational characteristics between Prefill and Decode stages; 4. Ultra-large-scale model parameters exceed single-GPU memory capacity, requiring parallel expansion.

3

Section 03

Analysis of Core Technical Methods

PagedAttention

Introduce virtual memory management ideas into LLM inference: divide KV Cache into fixed-size pages, allocate on demand, store non-continuously, share pages (copy-on-write), and reuse memory pools.

vLLM Multi-GPU Parallelism

  • Tensor parallelism: Split attention heads and FFN layers, aggregate results via All-Reduce;
  • Pipeline parallelism: Split the model by layers, use micro-batch pipelining and interleaved scheduling to hide latency.

Diffusion Model Acceleration

  • Activation caching: Cache outputs of layers with small changes between adjacent iterations;
  • Step optimization: DDIM (1000→50 steps), DPM-Solver, consistency models.

ORCA Scheduling

Iteration-level scheduling (reorganize batches after each generation iteration) + selective batching to optimize GPU utilization.

Sarathi-Serve

Chunked-Prefill: Split long prompts into multiple chunks, execute interleaved with Decode requests.

4

Section 04

Technical Effects and Evidence

  • PagedAttention: Memory utilization increased from 20-40% to over 80%, batch processing capability enhanced, throughput improved, tail latency reduced;
  • vLLM multi-GPU parallelism: Supports splitting and scaling of ultra-large-scale models;
  • Diffusion model acceleration: Significantly reduces generation time via caching and step optimization;
  • ORCA: Solves the tail latency problem of traditional batching; new requests can be immediately added to the next iteration;
  • Sarathi-Serve: Balances resource utilization between Prefill and Decode, avoiding long prompts blocking short requests.
5

Section 05

Optimization Principles and Future Outlook

Core Optimization Principles

  1. Memory is the bottleneck: Optimization revolves around reducing memory access;
  2. Batching is key: Intelligent batching fully utilizes GPU parallel capabilities;
  3. Latency vs. throughput trade-off: Different goals for different scenarios;
  4. Hardware-software co-design: Design software combining hardware features (Tensor Core, HBM).

Future Directions

Speculative decoding, quantization compression (INT8/INT4), multimodal inference, edge deployment (lightweighting).

6

Section 06

Learning Path and Practical Recommendations

Learning Sequence

  1. Basics: Transformer architecture and attention mechanism;
  2. Optimization: PagedAttention memory management;
  3. Parallelism: Implementation of tensor parallelism and pipeline parallelism;
  4. Scheduling: ORCA and Sarathi strategies;
  5. Systems: Design a complete inference service architecture.

Hands-on Practice

  • Reproduce performance tests to build intuition;
  • Modify parameters (page size, chunk size) to observe impacts;
  • Validate theories on actual models;
  • Contribute improvements to open-source communities.