# TierKV: A Cross-Node Distributed KV Cache System That Speeds Up LLM Long-Context Inference by 7x

> TierKV retains evicted KV caches across networks via a three-tier architecture (GPU Hot Tier, LAN Cold KV Tier, WiFi Cold SSM Tier), reducing the Time To First Token (TTFT) of long-context inference from 30 seconds to 4 seconds and providing a new approach for cost-effective expansion of LLM inference context length.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-03T07:43:08.000Z
- 最近活动: 2026-05-03T07:49:57.018Z
- 热度: 159.9
- 关键词: LLM推理, KV缓存, 分布式系统, 量化压缩, 长上下文, EXO框架, gRPC, TurboQuant
- 页面链接: https://www.zingnex.cn/en/forum/thread/tierkv-kv-llm7
- Canonical: https://www.zingnex.cn/forum/thread/tierkv-kv-llm7
- Markdown 来源: floors_fallback

---

## TierKV: A Cross-Node Distributed KV Cache System That Speeds Up LLM Long-Context Inference by 7x [Introduction]

TierKV is a cross-node distributed KV cache system. Targeting the cold start problem caused by KV cache eviction in LLM long-context inference, it retains evicted KV caches across networks through a three-tier architecture (GPU Hot Tier, LAN Cold KV Tier, WiFi Cold SSM Tier). This reduces the TTFT of long-context inference from 30 seconds to 4 seconds (a 7x speedup) and provides a new idea for cost-effective expansion of LLM inference context length.

## Problem Background: KV Cache Bottleneck in LLM Long-Context Inference

Modern LLM inference faces a core contradiction: users expect to handle longer contexts (tens of thousands to hundreds of thousands of tokens), but GPU memory is scarce. When KV cache fills up memory, old caches must be evicted, leading to re-computation for subsequent identical prompts (cold start). For example, the Qwen3.6-35B-A3B model requires about 70GB of BF16 KV cache for an 8000-token prompt, which easily hits the memory limit of a single card. Traditional quantization and paged cache methods cannot fundamentally solve the memory constraint.

## Three-Tier Architecture Design: GPU Hot Tier + LAN Cold KV Tier + WiFi Cold SSM Tier

### Hot Tier: KVPrefixCache on GPU
Implemented based on the EXO framework and residing in GPU memory. Eviction is triggered when memory usage reaches the 60% threshold, and evicted KV data is sent to the cold tier via hook functions.

### Cold KV Tier: Cross-Node Storage for Full Attention Layers
Stores complete attention KV states and transmits them to dedicated nodes in the LAN (e.g., Mac Pro) using the gRPC protocol. Before transmission, TurboQuant INT8 quantization is applied (3.9x compression ratio, SNR ≥52dB). The 10 full attention layers of Qwen3.6-35B-A3B are sent to this tier.

### Cold SSM Tier: Separate Storage for Linear Attention Layers
Separately stores the 30 SSM states of hybrid architecture models (e.g., Qwen3.6) on WiFi-connected nodes (e.g., Mac Air). Parallel transmission reduces network bottlenecks.

## Performance Test Results: Long-Context Inference TTFT Reduced from 30s to 4s, 7x Speedup

Test Configuration: DGX Spark inference node (128GB memory), Mac Pro M2 cold KV tier (32GB memory, 10GbE LAN), Mac Air M2 cold SSM tier (16GB memory, WiFi).

Results:
- 8000-token prompt: Cold start TTFT = 30.83s, cold tier recovery TTFT = 4.11s (7.3x speedup)
- 3707-token prompt: Cold start TTFT =23.78s, cold tier recovery TTFT=4.59s (5.2x speedup)

Applicable scenarios: Customer service bots handling long conversation histories, code assistants analyzing large projects, etc.

## Technical Implementation Details: Quantization, Batch Transfer, and Automatic Layer Detection

### TurboQuant Quantization Algorithm
Optimized for KV tensors: Group quantization (each group of 256 floats shares a scaling factor `max(|x|)/127`), BF16 to INT8 conversion with a 3.9x compression ratio and SNR ≥52dB.

### Batch Transfer Optimization
Changes 40 sequential RPCs to two concurrent BatchPromote calls. When cache misses occur, it pulls cold tier data for both KV and SSM simultaneously to reduce network overhead.

### Automatic Layer Type Detection
Automatically identifies full attention/linear attention layers using `isinstance`, eliminating the need for manual index configuration and improving generality.

## Deployment and Usage: Multi-Node Configuration and Integration Steps

Deployment requires at least two machines (inference + cold storage); a three-machine configuration can separate KV and SSM tiers. Steps:
1. Clone the repository and build the Rust extension: `cd tierkv-core && maturin develop --release`
2. Install the Python package: `pip install -e .`
3. Edit `tierkv.toml` to set node IPs and roles
4. Start the service on cold tier nodes: `tierkv vault --port 50051`
5. Integrate EXO on inference nodes: Run `tierkv install` and add hook code

The configuration file can adjust parameters like memory threshold and quantization dimension.

## Future Development Directions: Persistence, Adaptive Quantization, and Other Optimizations

Future improvements for TierKV:
- Persistent cold storage: Support SQLite/memory-mapped files to retain data after restarts
- Adaptive quantization: Train TurboQuant codebooks using real KV data to improve SNR
- LRU eviction strategy: Add cold tier capacity limits and LRU eviction
- WiFi performance optimization: Support LAN connections or multi-path transmission

Currently, cold tier data is only stored in memory, and the WiFi connection for the SSM tier is a bottleneck.

## Summary: Value of TierKV and Open Source Information

TierKV uses idle memory of LAN devices to turn discarded KV caches into reusable assets. In tests, 6 out of 227 evictions were successfully recovered, saving about 26 seconds each—its value is significant for large-scale deployments. It is suitable for teams that want to expand context capabilities without increasing GPU investments.

The project is open source; code and documentation are available on GitHub.