Section 01
TriAttention Core Guide: Trigonometric KV Cache Compression for Worry-Free Long Text Reasoning Memory
This article introduces the TriAttention technology, which addresses the KV cache memory explosion problem in long text reasoning for large language models. By leveraging the concentration property of Q/K vectors in the pre-RoPE space and using trigonometric series to estimate key-value importance, it achieves 10.7x KV memory compression in 32K token scenarios while maintaining full attention accuracy, along with a 2.5x throughput improvement. It also provides a GGUF implementation supporting deployment on consumer GPUs.