# TriAttention: Compressing KV Cache with Trigonometric Series to Run Long-Inference Models on Consumer GPUs

> How to solve the KV cache memory bottleneck in long-text inference? TriAttention leverages the concentration phenomenon of Q/K vectors in the pre-RoPE space and uses trigonometric series to model distance preferences. While maintaining full attention accuracy, it achieves a 10.7x KV memory compression and a 2.5x throughput improvement, enabling 32K token inference to run on a single consumer GPU for the first time.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-06T17:58:42.000Z
- 最近活动: 2026-04-07T07:56:46.057Z
- 热度: 128.0
- 关键词: KV缓存压缩, 长文本推理, RoPE位置编码, 注意力机制优化, LLM推理效率, 内存优化, Transformer架构, 大模型部署
- 页面链接: https://www.zingnex.cn/en/forum/thread/triattention-kv
- Canonical: https://www.zingnex.cn/forum/thread/triattention-kv
- Markdown 来源: floors_fallback

---

## 【Introduction】TriAttention: Compressing KV Cache with Trigonometric Series to Run Long-Inference Models on Consumer GPUs

Long-text inference reshapes the capability boundary of large language models, but KV cache memory explosion has become a deployment bottleneck. By exploring the concentration phenomenon of Q/K vectors in the pre-RoPE space and using trigonometric series to model distance preferences, TriAttention achieves a 10.7x KV memory compression and a 2.5x throughput improvement while maintaining full attention accuracy, enabling 32K token inference to run on a single consumer GPU for the first time.

## Memory Dilemma of Long Inference: Why KV Cache Becomes a Bottleneck

Modern LLM inference consists of pre-filling and decoding stages. During decoding, the KV cache grows linearly with sequence length—32K token inference requires dozens of GB of VRAM, which exceeds the capacity of consumer GPUs. Existing compression methods rely on post-RoPE attention scores, but RoPE rotation causes query vectors to disperse, leading to sparse sampling, suboptimal key selection, and unstable inference.

## Discovery in Pre-RoPE Space: Concentration Phenomenon of Q/K Vectors

The core insight of TriAttention comes from observations in the pre-RoPE space: Q/K vectors are highly concentrated around fixed non-zero centers, and the distribution pattern is stable across positions (Q/K concentration phenomenon). Mathematical analysis shows that this property makes queries prioritize keys at specific distances, and distance preferences can be accurately characterized by trigonometric series—each center corresponds to a specific frequency component.

## Core Mechanism of TriAttention: Trigonometric Series Distance Modeling

TriAttention does not rely on post-RoPE attention scores; it directly leverages the concentration characteristics of Q/K in the pre-RoPE space: 1. Identify concentration centers (encoding distance preference patterns); 2. Decompose centers using trigonometric series to calculate distance preference scores for keys; 3. Combine Q/K norms to improve key selection accuracy. This mechanism computes in constant time with no additional sequence length overhead, making it suitable for ultra-long inference.

## Experimental Validation: Dual Breakthroughs in Accuracy and Efficiency

In the AIME25 benchmark test for 32K token inference, TriAttention's performance: 1. Accuracy is basically the same as full attention; 2. 10.7x KV memory compression; 3. 2.5x throughput improvement. Compared to baseline methods, it has about twice the accuracy at the same efficiency, and for the first time enables 32K inference to run on a single consumer GPU.

## Technical Insights and Future Outlook

TriAttention demonstrates the value of deeply understanding the internal mechanisms of Transformers and triggers thinking about position encoding design (the value of information in the pre-RoPE space). In applications, it promotes the inclusive deployment of long-context LLMs, making it possible to run them on edge devices. The team plans to open-source the implementation and explore applications in scenarios such as multimodal long sequences and real-time dialogue.
