正文

RDMA KV Cache：基于 GPUDirect 的分离式 LLM 推理加速方案

rdma-kv-cache 项目实现了分离式 LLM 推理架构，通过 GPUDirect RDMA 技术实现 GPU 间的零拷贝 KV Cache 传输，显著降低大模型推理延迟。

LLM推理RDMAGPUDirectvLLM分离式架构GPU优化高性能计算

发布时间 2026/04/27 06:40最近活动 2026/04/27 07:20预计阅读 5 分钟

RDMA KV Cache：基于 GPUDirect 的分离式 LLM 推理加速方案

章节 01

RDMA KV Cache: A Separated LLM Inference Acceleration Scheme Using GPUDirect

rdma-kv-cache project implements a separated LLM inference architecture, using GPUDirect RDMA technology to achieve zero-copy KV Cache transfer between GPUs, significantly reducing large model inference latency.

章节 02

Background: Challenges of Traditional LLM Inference & Separated Architecture

With the continuous growth of large language model scale, traditional monolithic inference architecture faces resource imbalance issues—prefill stage is compute-intensive but short, while decode stage is memory-intensive but low in compute density. Separated inference architecture decouples these two stages to dedicated GPU clusters, using high-speed interconnect to transfer KV Cache for more efficient resource utilization.

章节 03

Core Architecture & Key Components

The system consists of four core components:

Orchestrator: Request entry, routes requests to prefill nodes, manages node registration and load balancing.
Prefill Nodes: Run custom vLLM, handle input prompts to generate initial KV Cache, and send it via RDMA.
Decode Nodes: Run custom vLLM, receive KV Cache via RDMA and perform autoregressive token generation.
GPUDirect RDMA: Technical foundation, uses Mellanox nvidia-peermem kernel module to enable direct GPU memory access, bypassing CPU with microsecond-level latency.

章节 04

Technical Implementation Details

vLLM Integration: Uses RDMAConnector module with C++ RDMA engine (exposed via pybind11 bindings) to retain vLLM's scheduling optimizations.
Zero-Copy Transfer: Eliminates intermediate memory copies (GPU→system→network→system→GPU) by directly registering vLLM memory regions and sending to decode nodes' GPU buffers.
Deployment Tools: Includes install scripts (install_all_nodes.sh), deployment scripts (smart_deploy.sh), run/stop scripts (smart_run.sh, smart_stop.sh), and verification script (verify_gpudirect.sh).

章节 05

Performance Metrics & Optimization Directions

Performance Indicators: Prefill stage takes 50-200ms (depending on prompt length), RDMA transfer latency is at microsecond level; benchmark tools like benchmark_client are provided.
Optimization: Adjust key parameters (KV buffer size, RDMA block size, queue depth) via YAML config; model configs (memory utilization, max sequence length) can be tuned via environment variables.

章节 06

Hardware & Software Requirements

Hardware: NVIDIA A100 GPU, Mellanox ConnectX InfiniBand NIC.
Software: Ubuntu 24.04, CUDA 12.6, Python 3.10+, CMake 3.18+.
Testing Environment: Developed and tested on Clemson CloudLab cluster.

章节 07

Application Scenarios & Project Value

Suitable Scenarios: High-concurrency online services (prefill handles many initial requests), long-sequence inference (independent decode scaling), heterogeneous GPU environments (different GPUs for prefill/decode).
Value: Provides open-source reference implementation for separated inference architecture; GPUDirect RDMA integration is valuable for researchers and engineers pursuing extreme inference performance.