Zing Forum

Reading

Ragged Paged Attention: A High-Performance LLM Inference Kernel Built for TPUs

The Google Research team has launched the RPA kernel, which achieves 86% memory bandwidth utilization and 73% model FLOPs utilization on TPUs through three key technologies: fine-grained chunking, software pipeline fusion, and distribution-aware compilation, providing a production-grade solution for LLM inference.

TPULLM推理注意力机制内核优化vLLMSGLangPagedAttention大模型部署
Published 2026-04-17 02:30Recent activity 2026-04-20 09:49Estimated read 5 min
Ragged Paged Attention: A High-Performance LLM Inference Kernel Built for TPUs
1

Section 01

[Introduction] Ragged Paged Attention: A High-Performance LLM Inference Kernel Tailored for TPUs

The Google Research team has introduced the Ragged Paged Attention (RPA) kernel, specifically designed for TPUs. Through three key technologies—fine-grained chunking, software pipeline fusion, and distribution-aware compilation—it achieves 86% memory bandwidth utilization and 73% model FLOPs utilization. It has been integrated into the vLLM and SGLang frameworks, providing a production-grade solution for LLM inference and enhancing the cost-effectiveness and ecosystem maturity of TPUs in inference scenarios.

2

Section 02

Background: Opportunities, Challenges of TPU Inference and the Dilemma of Ragged Execution

TPUs have become the preferred choice for enterprise LLM deployment due to their advantages in energy efficiency and total cost of ownership (TCO). However, most existing inference solutions are designed for GPUs, and efficient solutions for TPUs are scarce. Modern LLM services need to handle requests of varying lengths (ragged execution mode), facing three major challenges: 1. Memory fragmentation (difficult KV cache management); 2. Unbalanced computational load (invalid computations caused by padding); 3. Complex scheduling (resource balancing between prefill and decode phases).

3

Section 03

Core Technologies: Three Innovative Breakthroughs of RPA

RPA addresses these challenges through three key technologies:

  1. Fine-grained chunking and dynamic slicing: Divide KV cache into fixed pages, allocate on demand, dynamically slice, and reuse memory to reduce fragmentation;
  2. Software pipeline fusion: Deeply fuse KV updates with attention computation, keep intermediate results in SRAM to hide latency and improve throughput;
  3. Distribution-aware compilation: Generate dedicated kernels (for decode, prefill, and mixed loads) based on load types to adaptively optimize performance.
4

Section 04

Performance Evidence: Utilization Close to Hardware Limits

Evaluated on TPU v7x with the Llama 3 8B model:

  • Memory bandwidth utilization reaches 86% in the decode phase (eliminates memory bottlenecks, far exceeding the traditional 50-60%);
  • Model FLOPs utilization reaches 73% in the prefill phase (top-tier level, fully unleashing TPU's computational potential);
  • Integrated into vLLM and SGLang as TPU backends, allowing developers to enjoy performance improvements without modifying code.
5

Section 05

Technical Insights: Key Logic for TPU Architecture Adaptation

RPA optimizes for the architectural differences between TPUs and GPUs:

  1. Memory hierarchy: TPUs have larger HBM; fine-grained chunking maximizes local data reuse;
  2. Matrix computation units: TPU MXUs are suitable for large-scale operations; RPA aggregates small operations through batching and fusion;
  3. Compilation ecosystem: Pallas and Mosaic provide flexible abstractions to support complex kernel optimization.
6

Section 06

Conclusion and Outlook: Maturity and Future of the TPU Inference Ecosystem

RPA marks an improvement in the maturity of TPU inference:

  • Cost-effectiveness: Higher hardware utilization reduces inference costs;
  • Ecosystem improvement: Integration with mainstream frameworks lowers the barrier to TPU adoption;
  • Technical demonstration: Provides a reference for other accelerators (e.g., AWS Trainium, Graphcore IPU); In the future, as multimodal and agentic AI loads become more complex, RPA's fine-grained management and adaptive compilation may become the standard for next-generation inference systems.