Reading

TierKV: A Cross-Node Distributed KV Cache System That Speeds Up LLM Long-Context Inference by 7x

TierKV retains evicted KV caches across networks via a three-tier architecture (GPU Hot Tier, LAN Cold KV Tier, WiFi Cold SSM Tier), reducing the Time To First Token (TTFT) of long-context inference from 30 seconds to 4 seconds and providing a new approach for cost-effective expansion of LLM inference context length.

LLM推理KV缓存分布式系统量化压缩长上下文EXO框架gRPCTurboQuant

Published 2026-05-03 15:43Recent activity 2026-05-03 15:49Estimated read 8 min

Section 01

TierKV: A Cross-Node Distributed KV Cache System That Speeds Up LLM Long-Context Inference by 7x [Introduction]

TierKV is a cross-node distributed KV cache system. Targeting the cold start problem caused by KV cache eviction in LLM long-context inference, it retains evicted KV caches across networks through a three-tier architecture (GPU Hot Tier, LAN Cold KV Tier, WiFi Cold SSM Tier). This reduces the TTFT of long-context inference from 30 seconds to 4 seconds (a 7x speedup) and provides a new idea for cost-effective expansion of LLM inference context length.

Section 02

Problem Background: KV Cache Bottleneck in LLM Long-Context Inference

Modern LLM inference faces a core contradiction: users expect to handle longer contexts (tens of thousands to hundreds of thousands of tokens), but GPU memory is scarce. When KV cache fills up memory, old caches must be evicted, leading to re-computation for subsequent identical prompts (cold start). For example, the Qwen3.6-35B-A3B model requires about 70GB of BF16 KV cache for an 8000-token prompt, which easily hits the memory limit of a single card. Traditional quantization and paged cache methods cannot fundamentally solve the memory constraint.

Section 03

Three-Tier Architecture Design: GPU Hot Tier + LAN Cold KV Tier + WiFi Cold SSM Tier

Hot Tier: KVPrefixCache on GPU

Implemented based on the EXO framework and residing in GPU memory. Eviction is triggered when memory usage reaches the 60% threshold, and evicted KV data is sent to the cold tier via hook functions.

Cold KV Tier: Cross-Node Storage for Full Attention Layers

Stores complete attention KV states and transmits them to dedicated nodes in the LAN (e.g., Mac Pro) using the gRPC protocol. Before transmission, TurboQuant INT8 quantization is applied (3.9x compression ratio, SNR ≥52dB). The 10 full attention layers of Qwen3.6-35B-A3B are sent to this tier.

Cold SSM Tier: Separate Storage for Linear Attention Layers

Separately stores the 30 SSM states of hybrid architecture models (e.g., Qwen3.6) on WiFi-connected nodes (e.g., Mac Air). Parallel transmission reduces network bottlenecks.

Section 04

Performance Test Results: Long-Context Inference TTFT Reduced from 30s to 4s, 7x Speedup

Test Configuration: DGX Spark inference node (128GB memory), Mac Pro M2 cold KV tier (32GB memory, 10GbE LAN), Mac Air M2 cold SSM tier (16GB memory, WiFi).

Results:

8000-token prompt: Cold start TTFT = 30.83s, cold tier recovery TTFT = 4.11s (7.3x speedup)
3707-token prompt: Cold start TTFT =23.78s, cold tier recovery TTFT=4.59s (5.2x speedup)

Applicable scenarios: Customer service bots handling long conversation histories, code assistants analyzing large projects, etc.

Section 05

Technical Implementation Details: Quantization, Batch Transfer, and Automatic Layer Detection

TurboQuant Quantization Algorithm

Optimized for KV tensors: Group quantization (each group of 256 floats shares a scaling factor max(|x|)/127), BF16 to INT8 conversion with a 3.9x compression ratio and SNR ≥52dB.

Batch Transfer Optimization

Changes 40 sequential RPCs to two concurrent BatchPromote calls. When cache misses occur, it pulls cold tier data for both KV and SSM simultaneously to reduce network overhead.

Automatic Layer Type Detection

Automatically identifies full attention/linear attention layers using isinstance, eliminating the need for manual index configuration and improving generality.

Section 06

Deployment and Usage: Multi-Node Configuration and Integration Steps

Deployment requires at least two machines (inference + cold storage); a three-machine configuration can separate KV and SSM tiers. Steps:

Clone the repository and build the Rust extension: cd tierkv-core && maturin develop --release
Install the Python package: pip install -e .
Edit tierkv.toml to set node IPs and roles
Start the service on cold tier nodes: tierkv vault --port 50051
Integrate EXO on inference nodes: Run tierkv install and add hook code

The configuration file can adjust parameters like memory threshold and quantization dimension.

Section 07

Future Development Directions: Persistence, Adaptive Quantization, and Other Optimizations

Future improvements for TierKV:

Persistent cold storage: Support SQLite/memory-mapped files to retain data after restarts
Adaptive quantization: Train TurboQuant codebooks using real KV data to improve SNR
LRU eviction strategy: Add cold tier capacity limits and LRU eviction
WiFi performance optimization: Support LAN connections or multi-path transmission

Currently, cold tier data is only stored in memory, and the WiFi connection for the SSM tier is a bottleneck.

Section 08

Summary: Value of TierKV and Open Source Information

TierKV uses idle memory of LAN devices to turn discarded KV caches into reusable assets. In tests, 6 out of 227 evictions were successfully recovered, saving about 26 seconds each—its value is significant for large-scale deployments. It is suitable for teams that want to expand context capabilities without increasing GPU investments.

The project is open source; code and documentation are available on GitHub.