# kvcache-sim: A Multi-Tier KV Cache Simulation System for Large Model Inference

> A KV cache simulator supporting HBM/DRAM/SSD three-tier storage architecture, offering three simulation modes (single-node, 10k-card cluster, and PD separation), with six built-in eviction strategies including LRU, ARC, and Learned, which can be used to evaluate the cache efficiency and scalability of LLM inference systems.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-29T04:45:10.000Z
- 最近活动: 2026-04-29T04:49:25.680Z
- 热度: 150.9
- 关键词: KV缓存, LLM推理, 缓存仿真, Prefill-Decode分离, 多级存储, 驱逐策略, GPU集群, CXL内存
- 页面链接: https://www.zingnex.cn/en/forum/thread/kvcache-sim-kv
- Canonical: https://www.zingnex.cn/forum/thread/kvcache-sim-kv
- Markdown 来源: floors_fallback

---

## Introduction: kvcache-sim—A Multi-Tier KV Cache Simulation System for Large Model Inference

kvcache-sim is a KV cache simulator supporting HBM/DRAM/SSD three-tier storage architecture, offering three simulation modes (single-node, 10k-card cluster, and PD separation), with six built-in eviction strategies including LRU, ARC, and Learned, which can be used to evaluate the cache efficiency and scalability of LLM inference systems.

## Project Background: Core Challenges of KV Cache Management in LLM Inference

In LLM inference services, KV cache is a key technology to improve generation efficiency. As model size expands and context length increases, the storage demand for KV cache grows rapidly (e.g., a single request's KV cache can reach several GB or even tens of GB when a 70B model processes an 8K context). kvcache-sim was developed to support single-machine multi-tier storage, 10k-card cluster, and PD separation deployment modes, providing researchers and engineers with a comprehensive cache strategy evaluation tool.

## System Architecture: Three Simulation Modes Covering All Scenarios

### Single-Node Mode
Simulates GPU server architecture, supports 4 parallel workers, uses HBM→DRAM→SSD storage hierarchy, and includes six eviction strategies: LRU, ARC, SessionPrefetch, SelectiveWrite, Learned, and Belady Oracle.

### 10k-Card Cluster Mode
Simulates large-scale deployment of 10240 GPUs (160 racks ×64 GPUs/rack), introduces EIC shared memory pool, enables intra-rack cache sharing via CXL/RDMA, finely models latencies for intra-rack (3μs), cross-rack (15μs), and SSD access (200μs), and uses prefix-aware routing strategy to improve hit rate.

### PD Separation Mode
Implements Prefill-Decode decoupled architecture: PrefillNode is equipped with RadixTree prefix cache, DecodeNode receives KV cache transmitted via RDMA, supports push/pull/pull_on_demand transmission strategies, and uses dual routing layers to optimize load balancing and latency.

## Performance Validation: Metrics, Workloads, and Calibration

### Key Performance Metrics
Taking H100 running a 70B model as an example: Prefill is about 0.35ms/token, Decode is about 83.6ms/token, 64-sequence batch Decode is about 93.6ms/step; KV transmission for the first block is about 6.7ms, and for a full 8K prompt it's about 215ms.

### Real-Workload Support
Compatible with production-level trace datasets like BurstGPT, Azure LLM Inference Trace, Mooncake Traces, and SplitwiseSim, with automatic format conversion.

### Calibration and Integration
Supports injecting calibration parameters via overlay files (e.g., H100_70b_reference.yaml), and can integrate with external simulators like Vidur, Accel-Sim, and Ramulator2.

## Typical Application Scenarios: Aiding Key Design Decisions

1. P:D Ratio Selection: Find the optimal Prefill/Decode GPU ratio
2. Prefix Cache Capacity Planning: Determine the optimal ratio between KV cache and model weights
3. Interconnect Bandwidth Evaluation: Compare the impact of different RDMA configurations on transmission overhead
4. Eviction Strategy Selection: Choose the appropriate strategy based on workload
5. EIC Capacity Planning: Configure per-rack shared CXL memory
6. Impact of Context Length: Evaluate the effect of 4K/32K/128K contexts on cache

## Technical Highlights: Modular Design and Efficient Implementation

- Clear code structure: sim/ (core logic), trace/ (trace processing), learned/ (machine learning strategies), experiments/ (experiment scripts)
- RadixTree prefix cache: Uses reference counting for block sharing and recycling, consistent with the concepts of production systems like vLLM
- Pluggable strategies: Supports combinations of six eviction strategies and is easy to extend

## Summary and Outlook: A Fully-Featured KV Cache Simulation Platform

kvcache-sim covers popular LLM service directions such as single-node optimization, 10k-card cluster expansion, and PD separation architecture modeling. It provides an open-source tool (MIT license) for researchers and engineers, with clear code structure and complete documentation, facilitating secondary development and experiment reproduction.