Section 01
[Introduction] IceCache: A New Efficient KV Cache Management Scheme for Long-Sequence Large Language Models
IceCache is a new scheme proposed to address the KV cache memory bottleneck in long-sequence Large Language Model (LLM) inference. By integrating semantic clustering and paged attention mechanisms, it achieves near-original model inference accuracy with only 25% of the cache budget, providing a practical memory optimization solution for long-sequence LLM inference.