# RDMA KV Cache: A Separated LLM Inference Acceleration Scheme Based on GPUDirect

> The rdma-kv-cache project implements a separated LLM inference architecture, using GPUDirect RDMA technology to achieve zero-copy KV Cache transfer between GPUs, significantly reducing large model inference latency.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-26T22:40:41.000Z
- 最近活动: 2026-04-26T23:20:33.162Z
- 热度: 148.3
- 关键词: LLM推理, RDMA, GPUDirect, vLLM, 分离式架构, GPU优化, 高性能计算
- 页面链接: https://www.zingnex.cn/en/forum/thread/rdma-kv-cache-gpudirect-rdma
- Canonical: https://www.zingnex.cn/forum/thread/rdma-kv-cache-gpudirect-rdma
- Markdown 来源: floors_fallback

---

## RDMA KV Cache: A Separated LLM Inference Acceleration Scheme Using GPUDirect

rdma-kv-cache project implements a separated LLM inference architecture, using GPUDirect RDMA technology to achieve zero-copy KV Cache transfer between GPUs, significantly reducing large model inference latency.

## Background: Challenges of Traditional LLM Inference & Separated Architecture

With the continuous growth of large language model scale, traditional monolithic inference architecture faces resource imbalance issues—prefill stage is compute-intensive but short, while decode stage is memory-intensive but low in compute density. Separated inference architecture decouples these two stages to dedicated GPU clusters, using high-speed interconnect to transfer KV Cache for more efficient resource utilization.

## Core Architecture & Key Components

The system consists of four core components:
- **Orchestrator**: Request entry, routes requests to prefill nodes, manages node registration and load balancing.
- **Prefill Nodes**: Run custom vLLM, handle input prompts to generate initial KV Cache, and send it via RDMA.
- **Decode Nodes**: Run custom vLLM, receive KV Cache via RDMA and perform autoregressive token generation.
- **GPUDirect RDMA**: Technical foundation, uses Mellanox nvidia-peermem kernel module to enable direct GPU memory access, bypassing CPU with microsecond-level latency.

## Technical Implementation Details

- **vLLM Integration**: Uses RDMAConnector module with C++ RDMA engine (exposed via pybind11 bindings) to retain vLLM's scheduling optimizations.
- **Zero-Copy Transfer**: Eliminates intermediate memory copies (GPU→system→network→system→GPU) by directly registering vLLM memory regions and sending to decode nodes' GPU buffers.
- **Deployment Tools**: Includes install scripts (`install_all_nodes.sh`), deployment scripts (`smart_deploy.sh`), run/stop scripts (`smart_run.sh`, `smart_stop.sh`), and verification script (`verify_gpudirect.sh`).

## Performance Metrics & Optimization Directions

- **Performance Indicators**: Prefill stage takes 50-200ms (depending on prompt length), RDMA transfer latency is at microsecond level; benchmark tools like `benchmark_client` are provided.
- **Optimization**: Adjust key parameters (KV buffer size, RDMA block size, queue depth) via YAML config; model configs (memory utilization, max sequence length) can be tuned via environment variables.

## Hardware & Software Requirements

- **Hardware**: NVIDIA A100 GPU, Mellanox ConnectX InfiniBand NIC.
- **Software**: Ubuntu 24.04, CUDA 12.6, Python 3.10+, CMake 3.18+.
- **Testing Environment**: Developed and tested on Clemson CloudLab cluster.

## Application Scenarios & Project Value

- **Suitable Scenarios**: High-concurrency online services (prefill handles many initial requests), long-sequence inference (independent decode scaling), heterogeneous GPU environments (different GPUs for prefill/decode).
- **Value**: Provides open-source reference implementation for separated inference architecture; GPUDirect RDMA integration is valuable for researchers and engineers pursuing extreme inference performance.
