Section 01
DepthKV: Hierarchical KV Cache Pruning Technology for Long-Context LLM Inference (Introduction)
DepthKV proposes an innovative hierarchical KV cache pruning strategy. By identifying the differentiated KV cache requirements of different Transformer layers, it significantly reduces memory overhead in long-context LLM inference while maintaining model performance. Based on layer dependency differences, this strategy aggressively compresses insensitive layers and retains high precision in critical layers, providing an effective memory optimization solution for long-context LLM inference.