Section 01
【Introduction】TriAttention: Compressing KV Cache with Trigonometric Series to Run Long-Inference Models on Consumer GPUs
Long-text inference reshapes the capability boundary of large language models, but KV cache memory explosion has become a deployment bottleneck. By exploring the concentration phenomenon of Q/K vectors in the pre-RoPE space and using trigonometric series to model distance preferences, TriAttention achieves a 10.7x KV memory compression and a 2.5x throughput improvement while maintaining full attention accuracy, enabling 32K token inference to run on a single consumer GPU for the first time.