Zing Forum

Reading

kvcache-sim: A Multi-Tier KV Cache Simulation System for Large Model Inference

A KV cache simulator supporting HBM/DRAM/SSD three-tier storage architecture, offering three simulation modes (single-node, 10k-card cluster, and PD separation), with six built-in eviction strategies including LRU, ARC, and Learned, which can be used to evaluate the cache efficiency and scalability of LLM inference systems.

KV缓存LLM推理缓存仿真Prefill-Decode分离多级存储驱逐策略GPU集群CXL内存
Published 2026-04-29 12:45Recent activity 2026-04-29 12:49Estimated read 6 min
kvcache-sim: A Multi-Tier KV Cache Simulation System for Large Model Inference
1

Section 01

Introduction: kvcache-sim—A Multi-Tier KV Cache Simulation System for Large Model Inference

kvcache-sim is a KV cache simulator supporting HBM/DRAM/SSD three-tier storage architecture, offering three simulation modes (single-node, 10k-card cluster, and PD separation), with six built-in eviction strategies including LRU, ARC, and Learned, which can be used to evaluate the cache efficiency and scalability of LLM inference systems.

2

Section 02

Project Background: Core Challenges of KV Cache Management in LLM Inference

In LLM inference services, KV cache is a key technology to improve generation efficiency. As model size expands and context length increases, the storage demand for KV cache grows rapidly (e.g., a single request's KV cache can reach several GB or even tens of GB when a 70B model processes an 8K context). kvcache-sim was developed to support single-machine multi-tier storage, 10k-card cluster, and PD separation deployment modes, providing researchers and engineers with a comprehensive cache strategy evaluation tool.

3

Section 03

System Architecture: Three Simulation Modes Covering All Scenarios

Single-Node Mode

Simulates GPU server architecture, supports 4 parallel workers, uses HBM→DRAM→SSD storage hierarchy, and includes six eviction strategies: LRU, ARC, SessionPrefetch, SelectiveWrite, Learned, and Belady Oracle.

10k-Card Cluster Mode

Simulates large-scale deployment of 10240 GPUs (160 racks ×64 GPUs/rack), introduces EIC shared memory pool, enables intra-rack cache sharing via CXL/RDMA, finely models latencies for intra-rack (3μs), cross-rack (15μs), and SSD access (200μs), and uses prefix-aware routing strategy to improve hit rate.

PD Separation Mode

Implements Prefill-Decode decoupled architecture: PrefillNode is equipped with RadixTree prefix cache, DecodeNode receives KV cache transmitted via RDMA, supports push/pull/pull_on_demand transmission strategies, and uses dual routing layers to optimize load balancing and latency.

4

Section 04

Performance Validation: Metrics, Workloads, and Calibration

Key Performance Metrics

Taking H100 running a 70B model as an example: Prefill is about 0.35ms/token, Decode is about 83.6ms/token, 64-sequence batch Decode is about 93.6ms/step; KV transmission for the first block is about 6.7ms, and for a full 8K prompt it's about 215ms.

Real-Workload Support

Compatible with production-level trace datasets like BurstGPT, Azure LLM Inference Trace, Mooncake Traces, and SplitwiseSim, with automatic format conversion.

Calibration and Integration

Supports injecting calibration parameters via overlay files (e.g., H100_70b_reference.yaml), and can integrate with external simulators like Vidur, Accel-Sim, and Ramulator2.

5

Section 05

Typical Application Scenarios: Aiding Key Design Decisions

  1. P:D Ratio Selection: Find the optimal Prefill/Decode GPU ratio
  2. Prefix Cache Capacity Planning: Determine the optimal ratio between KV cache and model weights
  3. Interconnect Bandwidth Evaluation: Compare the impact of different RDMA configurations on transmission overhead
  4. Eviction Strategy Selection: Choose the appropriate strategy based on workload
  5. EIC Capacity Planning: Configure per-rack shared CXL memory
  6. Impact of Context Length: Evaluate the effect of 4K/32K/128K contexts on cache
6

Section 06

Technical Highlights: Modular Design and Efficient Implementation

  • Clear code structure: sim/ (core logic), trace/ (trace processing), learned/ (machine learning strategies), experiments/ (experiment scripts)
  • RadixTree prefix cache: Uses reference counting for block sharing and recycling, consistent with the concepts of production systems like vLLM
  • Pluggable strategies: Supports combinations of six eviction strategies and is easy to extend
7

Section 07

Summary and Outlook: A Fully-Featured KV Cache Simulation Platform

kvcache-sim covers popular LLM service directions such as single-node optimization, 10k-card cluster expansion, and PD separation architecture modeling. It provides an open-source tool (MIT license) for researchers and engineers, with clear code structure and complete documentation, facilitating secondary development and experiment reproduction.