Zing Forum

Reading

GPUCache: A PB-scale Ultra-low Latency Distributed GPU Cache System to Eliminate Redundant Computation Overhead in Large Model Inference

This article introduces GPUCache, an open-source PB-scale distributed GPU cache system. Using Rust, NVIDIA DOCA, RDMA, and BF-4 DPU technologies, it builds a high-speed bridge between GPU HBM and NVMe storage, significantly reducing redundant computation costs in large language model (LLM) inference.

GPU缓存大语言模型AI推理RustNVIDIA DOCARDMADPU分布式系统低延迟PB级存储
Published 2026-05-26 19:07Recent activity 2026-05-26 19:31Estimated read 6 min
GPUCache: A PB-scale Ultra-low Latency Distributed GPU Cache System to Eliminate Redundant Computation Overhead in Large Model Inference
1

Section 01

GPUCache Project Overview: A PB-scale Ultra-low Latency Distributed GPU Cache System

GPUCache is an open-source PB-scale ultra-low latency distributed GPU cache system developed by the rustfs team, specifically designed for AI inference scenarios. The project was released on May 26, 2026, and its source code is hosted on GitHub (link: https://github.com/rustfs/GPUCache).

Using Rust language, NVIDIA DOCA framework, RDMA network protocol, and BF-4 DPU technology, this system builds a high-speed bridge between GPU HBM and NVMe storage, aiming to solve memory bottleneck issues in LLM inference, eliminate redundant computation overhead, and balance cost and performance.

2

Section 02

Background and Challenges: Memory Bottleneck Dilemma in LLM Inference

With the exponential growth of LLM scale, inference faces severe memory bottlenecks:

  • Model parameters reach billions or even trillions, requiring frequent access to massive KV caches during inference;
  • Defects of traditional solutions: Full HBM cache has high cost and limited capacity; Offloading to CPU memory/NVMe leads to excessive latency, affecting performance; Core requirement: How to achieve PB-scale cache capacity expansion while maintaining ultra-low latency?
3

Section 03

Core Technical Architecture: High-performance Design with Hardware-Software Coordination

Rust Language Foundation

Choosing Rust ensures zero-cost abstractions, memory safety, and avoids latency jitter, making it suitable for latency-sensitive systems.

NVIDIA DOCA and BF-4 DPU

Using the DOCA framework to offload operations like cache management, data compression/encryption onto BF-4 DPU, freeing up host CPU resources and reducing processing latency.

RDMA Network Transmission

Using RDMA to achieve high-speed data transmission between distributed nodes, remote cache access latency is close to local memory. Nodes are interconnected via 100Gbps+ RDMA network cards, and data is directly transmitted from remote NVMe to local GPU memory.

4

Section 04

Key Problem Solutions: Eliminating Redundant Computation Tax and Cost Optimization

  1. Eliminating Redundant Computation Tax: PB-scale cache retains KV values from long conversation history, reducing long-context inference latency from seconds to milliseconds;
  2. Cost and Performance Balance: Using low-cost NVMe SSD as the cache backend, combined with hot data identification/prefetching algorithms, it achieves performance close to HBM while reducing cost per TB by an order of magnitude;
  3. Distributed Expansion: Linear capacity expansion by adding nodes, supporting PB-scale storage to meet the needs of ultra-long documents/large-scale conversations.
5

Section 05

Application Scenarios and Value: Adapting to Multiple AI Workloads

GPUCache is suitable for the following scenarios:

  • Long-context LLM services: Maintaining stable response speed for ultra-long document processing;
  • Multi-tenant dialogue systems: Caching user conversation history to quickly restore states;
  • Batch inference optimization: Caching common prefix computation results to reduce redundant calculations;
  • Hybrid deployment: Helping model fine-tuning and inference services share resources to improve utilization.
6

Section 06

Technical Significance and Outlook: Evolution Direction of AI Infrastructure

GPUCache demonstrates the collaboration of Rust, DPU offload, and RDMA to build a storage system beyond traditional architectures, breaking through single hardware bottlenecks.

As LLM scale grows, such dedicated cache systems will become more important in AI infrastructure, providing a scalable path for even larger models in the future.

Recommendation: Infrastructure teams for large-scale LLM services can conduct in-depth research and evaluation of this open-source project.