Zing Forum

Reading

RDMA KV Cache: A Separated LLM Inference Acceleration Scheme Based on GPUDirect

The rdma-kv-cache project implements a separated LLM inference architecture, using GPUDirect RDMA technology to achieve zero-copy KV Cache transfer between GPUs, significantly reducing large model inference latency.

LLM推理RDMAGPUDirectvLLM分离式架构GPU优化高性能计算
Published 2026-04-27 06:40Recent activity 2026-04-27 07:20Estimated read 5 min
RDMA KV Cache: A Separated LLM Inference Acceleration Scheme Based on GPUDirect
1

Section 01

RDMA KV Cache: A Separated LLM Inference Acceleration Scheme Using GPUDirect

rdma-kv-cache project implements a separated LLM inference architecture, using GPUDirect RDMA technology to achieve zero-copy KV Cache transfer between GPUs, significantly reducing large model inference latency.

2

Section 02

Background: Challenges of Traditional LLM Inference & Separated Architecture

With the continuous growth of large language model scale, traditional monolithic inference architecture faces resource imbalance issues—prefill stage is compute-intensive but short, while decode stage is memory-intensive but low in compute density. Separated inference architecture decouples these two stages to dedicated GPU clusters, using high-speed interconnect to transfer KV Cache for more efficient resource utilization.

3

Section 03

Core Architecture & Key Components

The system consists of four core components:

  • Orchestrator: Request entry, routes requests to prefill nodes, manages node registration and load balancing.
  • Prefill Nodes: Run custom vLLM, handle input prompts to generate initial KV Cache, and send it via RDMA.
  • Decode Nodes: Run custom vLLM, receive KV Cache via RDMA and perform autoregressive token generation.
  • GPUDirect RDMA: Technical foundation, uses Mellanox nvidia-peermem kernel module to enable direct GPU memory access, bypassing CPU with microsecond-level latency.
4

Section 04

Technical Implementation Details

  • vLLM Integration: Uses RDMAConnector module with C++ RDMA engine (exposed via pybind11 bindings) to retain vLLM's scheduling optimizations.
  • Zero-Copy Transfer: Eliminates intermediate memory copies (GPU→system→network→system→GPU) by directly registering vLLM memory regions and sending to decode nodes' GPU buffers.
  • Deployment Tools: Includes install scripts (install_all_nodes.sh), deployment scripts (smart_deploy.sh), run/stop scripts (smart_run.sh, smart_stop.sh), and verification script (verify_gpudirect.sh).
5

Section 05

Performance Metrics & Optimization Directions

  • Performance Indicators: Prefill stage takes 50-200ms (depending on prompt length), RDMA transfer latency is at microsecond level; benchmark tools like benchmark_client are provided.
  • Optimization: Adjust key parameters (KV buffer size, RDMA block size, queue depth) via YAML config; model configs (memory utilization, max sequence length) can be tuned via environment variables.
6

Section 06

Hardware & Software Requirements

  • Hardware: NVIDIA A100 GPU, Mellanox ConnectX InfiniBand NIC.
  • Software: Ubuntu 24.04, CUDA 12.6, Python 3.10+, CMake 3.18+.
  • Testing Environment: Developed and tested on Clemson CloudLab cluster.
7

Section 07

Application Scenarios & Project Value

  • Suitable Scenarios: High-concurrency online services (prefill handles many initial requests), long-sequence inference (independent decode scaling), heterogeneous GPU environments (different GPUs for prefill/decode).
  • Value: Provides open-source reference implementation for separated inference architecture; GPUDirect RDMA integration is valuable for researchers and engineers pursuing extreme inference performance.