Zing Forum

Reading

TriAttention: Compressing KV Cache with Trigonometric Series to Run Long-Inference Models on Consumer GPUs

How to solve the KV cache memory bottleneck in long-text inference? TriAttention leverages the concentration phenomenon of Q/K vectors in the pre-RoPE space and uses trigonometric series to model distance preferences. While maintaining full attention accuracy, it achieves a 10.7x KV memory compression and a 2.5x throughput improvement, enabling 32K token inference to run on a single consumer GPU for the first time.

KV缓存压缩长文本推理RoPE位置编码注意力机制优化LLM推理效率内存优化Transformer架构大模型部署
Published 2026-04-07 01:58Recent activity 2026-04-07 15:56Estimated read 5 min
TriAttention: Compressing KV Cache with Trigonometric Series to Run Long-Inference Models on Consumer GPUs
1

Section 01

【Introduction】TriAttention: Compressing KV Cache with Trigonometric Series to Run Long-Inference Models on Consumer GPUs

Long-text inference reshapes the capability boundary of large language models, but KV cache memory explosion has become a deployment bottleneck. By exploring the concentration phenomenon of Q/K vectors in the pre-RoPE space and using trigonometric series to model distance preferences, TriAttention achieves a 10.7x KV memory compression and a 2.5x throughput improvement while maintaining full attention accuracy, enabling 32K token inference to run on a single consumer GPU for the first time.

2

Section 02

Memory Dilemma of Long Inference: Why KV Cache Becomes a Bottleneck

Modern LLM inference consists of pre-filling and decoding stages. During decoding, the KV cache grows linearly with sequence length—32K token inference requires dozens of GB of VRAM, which exceeds the capacity of consumer GPUs. Existing compression methods rely on post-RoPE attention scores, but RoPE rotation causes query vectors to disperse, leading to sparse sampling, suboptimal key selection, and unstable inference.

3

Section 03

Discovery in Pre-RoPE Space: Concentration Phenomenon of Q/K Vectors

The core insight of TriAttention comes from observations in the pre-RoPE space: Q/K vectors are highly concentrated around fixed non-zero centers, and the distribution pattern is stable across positions (Q/K concentration phenomenon). Mathematical analysis shows that this property makes queries prioritize keys at specific distances, and distance preferences can be accurately characterized by trigonometric series—each center corresponds to a specific frequency component.

4

Section 04

Core Mechanism of TriAttention: Trigonometric Series Distance Modeling

TriAttention does not rely on post-RoPE attention scores; it directly leverages the concentration characteristics of Q/K in the pre-RoPE space: 1. Identify concentration centers (encoding distance preference patterns); 2. Decompose centers using trigonometric series to calculate distance preference scores for keys; 3. Combine Q/K norms to improve key selection accuracy. This mechanism computes in constant time with no additional sequence length overhead, making it suitable for ultra-long inference.

5

Section 05

Experimental Validation: Dual Breakthroughs in Accuracy and Efficiency

In the AIME25 benchmark test for 32K token inference, TriAttention's performance: 1. Accuracy is basically the same as full attention; 2. 10.7x KV memory compression; 3. 2.5x throughput improvement. Compared to baseline methods, it has about twice the accuracy at the same efficiency, and for the first time enables 32K inference to run on a single consumer GPU.

6

Section 06

Technical Insights and Future Outlook

TriAttention demonstrates the value of deeply understanding the internal mechanisms of Transformers and triggers thinking about position encoding design (the value of information in the pre-RoPE space). In applications, it promotes the inclusive deployment of long-context LLMs, making it possible to run them on edge devices. The team plans to open-source the implementation and explore applications in scenarios such as multimodal long sequences and real-time dialogue.