Section 01
[Introduction] KVSculpt: An Innovative Approach to Reformulating KV Cache Compression as Knowledge Distillation
The long-context reasoning capability of large language models supports many applications, but the memory overhead of KV cache has become a bottleneck for deployment. Existing compression methods have limitations such as anchoring to original KV entries. KVSculpt innovatively reformulates KV cache compression as a knowledge distillation problem: it breaks away from anchoring to original entries, optimizes KV pairs in a continuous embedding space to preserve attention behavior, and introduces an adaptive budget allocation mechanism. Experiments show that on the Qwen2.5-1.5B model, KVSculpt achieves a 3.5-4.1x reduction in KL divergence, significantly improving compression effectiveness.