Section 01
[Introduction] LaProx: Redefining KV Cache Eviction Strategy for Long-Context LLM Inference
LaProx proposes a new output-aware KV cache eviction framework. By explicitly modeling the multiplicative interaction between attention maps and projected value states, it achieves a globally unified token importance assessment. This strategy maintains model performance even when only 5% of the cache is retained, providing an efficient solution to the memory bottleneck problem in long-context LLM inference.