Section 01
[Introduction] Ragged Paged Attention: A High-Performance LLM Inference Kernel Tailored for TPUs
The Google Research team has introduced the Ragged Paged Attention (RPA) kernel, specifically designed for TPUs. Through three key technologies—fine-grained chunking, software pipeline fusion, and distribution-aware compilation—it achieves 86% memory bandwidth utilization and 73% model FLOPs utilization. It has been integrated into the vLLM and SGLang frameworks, providing a production-grade solution for LLM inference and enhancing the cost-effectiveness and ecosystem maturity of TPUs in inference scenarios.