Zing Forum

Reading

CachePrune: A Privacy-Aware Fine-Grained KV Cache Sharing Mechanism for Efficient LLM Inference

This article introduces CachePrune, a privacy-aware fine-grained KV cache sharing mechanism that eliminates side-channel leakage risks caused by cross-user cache sharing while reducing TTFT by 4.5x and increasing cache hit rate by 44%. The mechanism accurately identifies reusable privacy-irrelevant segments through token-level cache management.

KV cacheprivacyside-channel attacksLLM inferencecache sharingvLLMTTFT optimization
Published 2026-05-22 21:54Recent activity 2026-05-25 11:23Estimated read 7 min
CachePrune: A Privacy-Aware Fine-Grained KV Cache Sharing Mechanism for Efficient LLM Inference
1

Section 01

[Introduction] CachePrune: A KV Cache Sharing Mechanism for LLM Inference That Balances Privacy and Efficiency

This article introduces CachePrune, a privacy-aware fine-grained KV cache sharing mechanism designed to address side-channel leakage risks from cross-user KV cache sharing in LLM inference while improving performance. Its core is token-level cache management to accurately identify reusable privacy-irrelevant segments. Under privacy protection, it reduces Time To First Token (TTFT) by 4.5x and increases cache hit rate by 44%. Implemented based on the vLLM framework, it applies to multi-tenant services, Agent workflows, and Retrieval-Augmented Generation (RAG), offering a practical solution for balancing privacy and efficiency in LLM services.

2

Section 02

The Double-Edged Sword of KV Cache Sharing: Dilemma Between Performance and Privacy Risks

KV cache is critical for LLM inference, reducing computational complexity from quadratic to linear and supporting long-context inference. Cross-user sharing of KV cache for similar content boosts performance but introduces side-channel attack risks—attackers can infer other users' inputs via cache hit detection. Existing defenses simply disable cross-user sharing, which is safe but sacrifices significant performance gains, especially in public-content-heavy scenarios like Agent systems.

3

Section 03

Core Innovations of CachePrune: Fine-Grained Privacy Awareness and Variable-Length Segment Management

CachePrune’s core insight is that privacy risks and cache reuse potential can be separated at the token level. Key designs include: 1. Flexible sensitivity annotation to mark sensitive areas by scenario; 2. Variable-length segment index structure for efficient retrieval of arbitrary-length reuse requests; 3. Strict privacy guarantees—KV representations of sensitive tokens are never cross-user shared, cutting off side-channel paths, with formal privacy analysis to prove security.

4

Section 04

System Architecture of CachePrune and vLLM Integration

CachePrune is built on the vLLM framework with main components: 1. Sensitivity-aware KV management: Split request KV into private (sensitive tokens) and shared (non-sensitive tokens) parts for dynamic offloading; 2. Variable-length segment index: Layered strategy (content hash positioning, prefix tree for variable length, precise comparison verification) to balance retrieval efficiency and accuracy; 3. Integration with vLLM’s PagedAttention mechanism, minimizing code coupling for easy maintenance and upgrades.

5

Section 05

Experimental Validation: Privacy Effectiveness and Performance Gains of CachePrune

Experimental results show: 1. Privacy protection: Fully resists side-channel attacks, even under the strongest threat model; 2. Performance gains: 4.5x TTFT reduction, 44% cache hit rate increase, and significant throughput growth; 3. Comparison to existing schemes: Security equals no-sharing baseline, performance approaches full-sharing baseline, outperforming sentence-level sharing; 4. Overhead analysis: Negligible sensitivity annotation delay, index maintenance overhead offset by gains, and acceptable memory increment.

6

Section 06

Applicable Scenarios and Practical Value of CachePrune

CachePrune is ideal for: 1. Multi-tenant LLM services: Ensure tenant isolation while reusing public content; 2. Agent workflow platforms: Maximize reuse of fixed tool descriptions and system prompts; 3. RAG systems: Safely reuse KV representations of overlapping knowledge base segments. In these scenarios, CachePrune balances privacy and efficiency to improve service quality.

7

Section 07

Limitations and Future Research Directions

CachePrune has limitations: 1. Sensitivity annotation accuracy depends on automated tools—incorrect annotations may cause privacy leaks; 2. Cache lifecycle management for dynamic content (e.g., real-time knowledge bases) needs optimization; 3. Only supports text modality, requiring extension to multi-modal KV cache management. Future research will address these areas.

8

Section 08

Conclusion: A New Path for Balancing Privacy and Efficiency in LLM Services

CachePrune demonstrates the value of fine-grained security strategies. By accurately identifying privacy boundaries, it achieves a win-win between privacy protection and performance. Its ideas apply not only to KV cache management but also inspire security design for other LLM components. As LLM services become widespread, CachePrune provides a practical solution for providers to build privacy-safe, high-performance inference services.