Reading

RDMA KV Cache: A Separated LLM Inference Acceleration Scheme Based on GPUDirect

The rdma-kv-cache project implements a separated LLM inference architecture, using GPUDirect RDMA technology to achieve zero-copy KV Cache transfer between GPUs, significantly reducing large model inference latency.

LLM推理RDMAGPUDirectvLLM分离式架构GPU优化高性能计算

Published 2026-04-27 06:40Recent activity 2026-04-27 07:20Estimated read 5 min

RDMA KV Cache: A Separated LLM Inference Acceleration Scheme Based on GPUDirect

Section 01

RDMA KV Cache: A Separated LLM Inference Acceleration Scheme Using GPUDirect

rdma-kv-cache project implements a separated LLM inference architecture, using GPUDirect RDMA technology to achieve zero-copy KV Cache transfer between GPUs, significantly reducing large model inference latency.

Section 02

Background: Challenges of Traditional LLM Inference & Separated Architecture

With the continuous growth of large language model scale, traditional monolithic inference architecture faces resource imbalance issues—prefill stage is compute-intensive but short, while decode stage is memory-intensive but low in compute density. Separated inference architecture decouples these two stages to dedicated GPU clusters, using high-speed interconnect to transfer KV Cache for more efficient resource utilization.

Section 03

Core Architecture & Key Components

The system consists of four core components:

Orchestrator: Request entry, routes requests to prefill nodes, manages node registration and load balancing.
Prefill Nodes: Run custom vLLM, handle input prompts to generate initial KV Cache, and send it via RDMA.
Decode Nodes: Run custom vLLM, receive KV Cache via RDMA and perform autoregressive token generation.
GPUDirect RDMA: Technical foundation, uses Mellanox nvidia-peermem kernel module to enable direct GPU memory access, bypassing CPU with microsecond-level latency.

Section 04

Technical Implementation Details

vLLM Integration: Uses RDMAConnector module with C++ RDMA engine (exposed via pybind11 bindings) to retain vLLM's scheduling optimizations.
Zero-Copy Transfer: Eliminates intermediate memory copies (GPU→system→network→system→GPU) by directly registering vLLM memory regions and sending to decode nodes' GPU buffers.
Deployment Tools: Includes install scripts (install_all_nodes.sh), deployment scripts (smart_deploy.sh), run/stop scripts (smart_run.sh, smart_stop.sh), and verification script (verify_gpudirect.sh).

Section 05

Performance Metrics & Optimization Directions

Performance Indicators: Prefill stage takes 50-200ms (depending on prompt length), RDMA transfer latency is at microsecond level; benchmark tools like benchmark_client are provided.
Optimization: Adjust key parameters (KV buffer size, RDMA block size, queue depth) via YAML config; model configs (memory utilization, max sequence length) can be tuned via environment variables.

Section 06

Hardware & Software Requirements

Hardware: NVIDIA A100 GPU, Mellanox ConnectX InfiniBand NIC.
Software: Ubuntu 24.04, CUDA 12.6, Python 3.10+, CMake 3.18+.
Testing Environment: Developed and tested on Clemson CloudLab cluster.

Section 07

Application Scenarios & Project Value

Suitable Scenarios: High-concurrency online services (prefill handles many initial requests), long-sequence inference (independent decode scaling), heterogeneous GPU environments (different GPUs for prefill/decode).
Value: Provides open-source reference implementation for separated inference architecture; GPUDirect RDMA integration is valuable for researchers and engineers pursuing extreme inference performance.

RDMA KV Cache: A Separated LLM Inference Acceleration Scheme Based on GPUDirect

RDMA KV Cache: A Separated LLM Inference Acceleration Scheme Using GPUDirect

Background: Challenges of Traditional LLM Inference & Separated Architecture

Core Architecture & Key Components

Technical Implementation Details

Performance Metrics & Optimization Directions

Hardware & Software Requirements

Application Scenarios & Project Value

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model