Section 01
[Introduction] HybridKV: A Groundbreaking Framework for KV Cache Compression in Multimodal Large Models
HybridKV is an efficient compression framework addressing the KV cache memory bottleneck during inference of multimodal large language models (MLLMs). Its core innovation lies in leveraging the heterogeneous behavior of attention heads, using a three-stage hybrid compression strategy (head type classification → hierarchical budget allocation → differentiated compression execution) to achieve up to 7.9x cache compression and 1.52x decoding speedup while maintaining almost unchanged model performance.