# Ragged Paged Attention: A High-Performance LLM Inference Kernel Built for TPUs

> The Google Research team has launched the RPA kernel, which achieves 86% memory bandwidth utilization and 73% model FLOPs utilization on TPUs through three key technologies: fine-grained chunking, software pipeline fusion, and distribution-aware compilation, providing a production-grade solution for LLM inference.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-16T18:30:13.000Z
- 最近活动: 2026-04-20T01:49:40.984Z
- 热度: 70.0
- 关键词: TPU, LLM推理, 注意力机制, 内核优化, vLLM, SGLang, PagedAttention, 大模型部署
- 页面链接: https://www.zingnex.cn/en/forum/thread/ragged-paged-attention-tpullm
- Canonical: https://www.zingnex.cn/forum/thread/ragged-paged-attention-tpullm
- Markdown 来源: floors_fallback

---

## [Introduction] Ragged Paged Attention: A High-Performance LLM Inference Kernel Tailored for TPUs

The Google Research team has introduced the Ragged Paged Attention (RPA) kernel, specifically designed for TPUs. Through three key technologies—fine-grained chunking, software pipeline fusion, and distribution-aware compilation—it achieves 86% memory bandwidth utilization and 73% model FLOPs utilization. It has been integrated into the vLLM and SGLang frameworks, providing a production-grade solution for LLM inference and enhancing the cost-effectiveness and ecosystem maturity of TPUs in inference scenarios.

## Background: Opportunities, Challenges of TPU Inference and the Dilemma of Ragged Execution

TPUs have become the preferred choice for enterprise LLM deployment due to their advantages in energy efficiency and total cost of ownership (TCO). However, most existing inference solutions are designed for GPUs, and efficient solutions for TPUs are scarce. Modern LLM services need to handle requests of varying lengths (ragged execution mode), facing three major challenges: 1. Memory fragmentation (difficult KV cache management); 2. Unbalanced computational load (invalid computations caused by padding); 3. Complex scheduling (resource balancing between prefill and decode phases).

## Core Technologies: Three Innovative Breakthroughs of RPA

RPA addresses these challenges through three key technologies:
1. Fine-grained chunking and dynamic slicing: Divide KV cache into fixed pages, allocate on demand, dynamically slice, and reuse memory to reduce fragmentation;
2. Software pipeline fusion: Deeply fuse KV updates with attention computation, keep intermediate results in SRAM to hide latency and improve throughput;
3. Distribution-aware compilation: Generate dedicated kernels (for decode, prefill, and mixed loads) based on load types to adaptively optimize performance.

## Performance Evidence: Utilization Close to Hardware Limits

Evaluated on TPU v7x with the Llama 3 8B model:
- Memory bandwidth utilization reaches 86% in the decode phase (eliminates memory bottlenecks, far exceeding the traditional 50-60%);
- Model FLOPs utilization reaches 73% in the prefill phase (top-tier level, fully unleashing TPU's computational potential);
- Integrated into vLLM and SGLang as TPU backends, allowing developers to enjoy performance improvements without modifying code.

## Technical Insights: Key Logic for TPU Architecture Adaptation

RPA optimizes for the architectural differences between TPUs and GPUs:
1. Memory hierarchy: TPUs have larger HBM; fine-grained chunking maximizes local data reuse;
2. Matrix computation units: TPU MXUs are suitable for large-scale operations; RPA aggregates small operations through batching and fusion;
3. Compilation ecosystem: Pallas and Mosaic provide flexible abstractions to support complex kernel optimization.

## Conclusion and Outlook: Maturity and Future of the TPU Inference Ecosystem

RPA marks an improvement in the maturity of TPU inference:
- Cost-effectiveness: Higher hardware utilization reduces inference costs;
- Ecosystem improvement: Integration with mainstream frameworks lowers the barrier to TPU adoption;
- Technical demonstration: Provides a reference for other accelerators (e.g., AWS Trainium, Graphcore IPU);
In the future, as multimodal and agentic AI loads become more complex, RPA's fine-grained management and adaptive compilation may become the standard for next-generation inference systems.
