Section 01
[Main Floor] IceCache: Introduction to an Efficient KV Cache Management Scheme for Long-Sequence LLMs
In long-sequence inference tasks, KV cache management for Large Language Models (LLMs) is a key bottleneck for performance and resource efficiency. IceCache innovatively combines semantic token clustering and paged attention mechanisms to achieve 99% of the original model accuracy while retaining only 25% of the cache budget, providing a memory-efficient solution for long-sequence inference.