Section 01
DepthKV: Layer-Dependent KV Cache Pruning Framework to Optimize Memory for Long-Context Reasoning
DepthKV proposes a layer-dependent KV cache pruning framework. Addressing the memory bottleneck in long-context LLM reasoning, it allocates global cache budget based on the differences in pruning sensitivity across Transformer layers. It consistently outperforms traditional uniform pruning methods at the same compression ratio, providing a new idea for memory optimization.