Section 01
[Introduction] CachePrune: A KV Cache Sharing Mechanism for LLM Inference That Balances Privacy and Efficiency
This article introduces CachePrune, a privacy-aware fine-grained KV cache sharing mechanism designed to address side-channel leakage risks from cross-user KV cache sharing in LLM inference while improving performance. Its core is token-level cache management to accurately identify reusable privacy-irrelevant segments. Under privacy protection, it reduces Time To First Token (TTFT) by 4.5x and increases cache hit rate by 44%. Implemented based on the vLLM framework, it applies to multi-tenant services, Agent workflows, and Retrieval-Augmented Generation (RAG), offering a practical solution for balancing privacy and efficiency in LLM services.