Section 01
CombLlama: A Hybrid KV Cache Compression Architecture to Break the Memory Bottleneck of Long-Context LLM Inference
CombLlama proposes an innovative hybrid KV cache compression architecture, which aims to solve the memory bottleneck problem in long-context LLM inference by introducing chunk encoders and cross-attention mechanisms. While maintaining generation quality, this architecture significantly reduces the memory overhead of KV cache, providing a feasible solution for processing ultra-long sequences (such as entire books, multi-turn conversation histories, etc.).