Reading

Ragged Paged Attention: A High-Performance LLM Inference Kernel Built for TPUs

The Google Research team has launched the RPA kernel, which achieves 86% memory bandwidth utilization and 73% model FLOPs utilization on TPUs through three key technologies: fine-grained chunking, software pipeline fusion, and distribution-aware compilation, providing a production-grade solution for LLM inference.

TPULLM推理注意力机制内核优化vLLMSGLangPagedAttention大模型部署

Published 2026-04-17 02:30Recent activity 2026-04-20 09:49Estimated read 5 min

Ragged Paged Attention: A High-Performance LLM Inference Kernel Built for TPUs

Section 01

[Introduction] Ragged Paged Attention: A High-Performance LLM Inference Kernel Tailored for TPUs

The Google Research team has introduced the Ragged Paged Attention (RPA) kernel, specifically designed for TPUs. Through three key technologies—fine-grained chunking, software pipeline fusion, and distribution-aware compilation—it achieves 86% memory bandwidth utilization and 73% model FLOPs utilization. It has been integrated into the vLLM and SGLang frameworks, providing a production-grade solution for LLM inference and enhancing the cost-effectiveness and ecosystem maturity of TPUs in inference scenarios.

Section 02

Background: Opportunities, Challenges of TPU Inference and the Dilemma of Ragged Execution

TPUs have become the preferred choice for enterprise LLM deployment due to their advantages in energy efficiency and total cost of ownership (TCO). However, most existing inference solutions are designed for GPUs, and efficient solutions for TPUs are scarce. Modern LLM services need to handle requests of varying lengths (ragged execution mode), facing three major challenges: 1. Memory fragmentation (difficult KV cache management); 2. Unbalanced computational load (invalid computations caused by padding); 3. Complex scheduling (resource balancing between prefill and decode phases).

Section 03

Core Technologies: Three Innovative Breakthroughs of RPA

RPA addresses these challenges through three key technologies:

Fine-grained chunking and dynamic slicing: Divide KV cache into fixed pages, allocate on demand, dynamically slice, and reuse memory to reduce fragmentation;
Software pipeline fusion: Deeply fuse KV updates with attention computation, keep intermediate results in SRAM to hide latency and improve throughput;
Distribution-aware compilation: Generate dedicated kernels (for decode, prefill, and mixed loads) based on load types to adaptively optimize performance.

Section 04

Performance Evidence: Utilization Close to Hardware Limits

Evaluated on TPU v7x with the Llama 3 8B model:

Memory bandwidth utilization reaches 86% in the decode phase (eliminates memory bottlenecks, far exceeding the traditional 50-60%);
Model FLOPs utilization reaches 73% in the prefill phase (top-tier level, fully unleashing TPU's computational potential);
Integrated into vLLM and SGLang as TPU backends, allowing developers to enjoy performance improvements without modifying code.

Section 05

Technical Insights: Key Logic for TPU Architecture Adaptation

RPA optimizes for the architectural differences between TPUs and GPUs:

Memory hierarchy: TPUs have larger HBM; fine-grained chunking maximizes local data reuse;
Matrix computation units: TPU MXUs are suitable for large-scale operations; RPA aggregates small operations through batching and fusion;
Compilation ecosystem: Pallas and Mosaic provide flexible abstractions to support complex kernel optimization.

Section 06

Conclusion and Outlook: Maturity and Future of the TPU Inference Ecosystem

RPA marks an improvement in the maturity of TPU inference:

Cost-effectiveness: Higher hardware utilization reduces inference costs;
Ecosystem improvement: Integration with mainstream frameworks lowers the barrier to TPU adoption;
Technical demonstration: Provides a reference for other accelerators (e.g., AWS Trainium, Graphcore IPU); In the future, as multimodal and agentic AI loads become more complex, RPA's fine-grained management and adaptive compilation may become the standard for next-generation inference systems.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49