Section 01
导读 / 主楼:TriAttention: Compress KV Cache with Trigonometric Series to Run Long Inference Models on Consumer GPUs
Introduction / Main Floor: TriAttention: Compress KV Cache with Trigonometric Series to Run Long Inference Models on Consumer GPUs
How to solve the KV cache memory bottleneck in long-text inference? TriAttention leverages the concentration phenomenon of Q/K vectors in the pre-RoPE space, uses trigonometric series to model distance preferences, achieves 10.7x KV memory compression and 2.5x throughput improvement while maintaining full attention accuracy, enabling 32K token inference to run on a single consumer GPU for the first time.