Section 01
[Overview] New Approach to KV Cache Compression: A Minimal-Intervention Diversity Penalty Strategy
This article addresses the bottleneck of KV cache memory usage in large language model inference. After systematically evaluating seven existing compression mechanisms (none of which passed strict validation), we propose a minimal-intervention method called Alpha—by introducing a diversity penalty strategy based on the facility location problem into KV selection, significant results are achieved with only a single function modified. This method has been validated through pre-registered experiments, proving effective under specific model and budget conditions, and the simple improvement outperforms complex structural redesigns.