Zing Forum

Reading

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context Inference on Apple Silicon

Open-TQ-Metal is the first solution to implement fused compressed-domain attention on Apple Silicon, enabling the Llama 3.1 70B model with 128K context to run on a single 64GB consumer-grade Mac. By using custom Metal compute shaders to directly compute attention on int4 compressed representations, it achieves 48x attention acceleration and 3.2x memory compression under 128K context.

长上下文推理KV缓存量化Apple Silicon注意力机制端侧AI模型压缩消费级硬件
Published 2026-04-18 18:39Recent activity 2026-04-21 10:23Estimated read 5 min
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context Inference on Apple Silicon
1

Section 01

Open-TQ-Metal: A Groundbreaking Solution for Edge Long-Context Inference on Apple Silicon

Open-TQ-Metal is the first solution to implement fused compressed-domain attention on Apple Silicon, enabling the Llama 3.1 70B model with 128K context to run on a single 64GB consumer-grade Mac. By using custom Metal compute shaders to directly compute attention on int4 compressed representations, it achieves 48x attention acceleration and 3.2x memory compression, providing a feasible path for consumer devices to run long-context large models.

2

Section 02

Core Challenges of Running Long-Context Models on Consumer Hardware

The long-context capability of large language models has expanded to 128K or even millions of tokens, but is often locked to expensive data center GPUs. Take Llama 3.1 70B as an example: under FP16 precision, the KV cache for 128K context requires about 40GB of memory, exceeding the capacity of most consumer devices. Existing frameworks either do not support this configuration or rely on memory swapping, leading to a sharp drop in inference speed.

3

Section 03

Core Technology: An Innovative Paradigm of Fused Compressed-Domain Attention

The traditional KV cache quantization process is quantization → dequantization → FP16 attention computation, which has high overhead. Open-TQ-Metal's innovations: 1. Instantly quantize KV cache to int4; 2. Directly compute attention in the int4 compressed domain using custom Metal shaders without dequantization; 3. Eliminate the FP16 intermediate matrix after dequantization, saving memory and data movement overhead.

4

Section 04

Performance Verification: Measured Results of 48x Acceleration and 3.2x Memory Compression

330 experiments covering Gemma 4 31B and Llama 3.1 70B: Under 128K context, the fused sdpa_int4 kernel achieves 48x attention acceleration, and the top-1 prediction is consistent with FP16; KV cache reduced from 40GB to 12.5GB, a 3.2x compression ratio; First to support a single 64GB Mac to run Llama 3.1 70B with 128K context.

5

Section 05

Cross-Architecture Quantization Insight: The Key Role of Attention Scaling Factor

Open-TQ-Metal is the first to systematically analyze cross-architecture KV cache quantization, finding that the key to the success or failure of angular quantization schemes (such as PolarQuant) is the attention scaling factor: When Gemma 4 uses attn_scale=1.0, the directional error is amplified by 25-100 times; Llama uses 1/sqrt(d) scaling, which controls errors better, explaining the performance differences of quantization schemes across different models.

6

Section 06

Application Prospects: Edge AI Democratization and Open-Source Ecosystem Contributions

Open-TQ-Metal lowers the threshold for using long-context models: Developers can prototype applications on local Macs, process long documents/codebases (without cloud services), and protect data privacy; The open-source release provides Apple Silicon optimization technology, compressed-domain attention examples, and cross-model quantization methodology; The principle can be extended to mobile edge devices.

7

Section 07

Limitations and Future: Expanding Platforms and More Aggressive Quantization Exploration

Current limitations: Only supports Apple Silicon, mainly covers the Gemma/Llama family, and focuses on int4. Future directions: Expand to hardware such as Qualcomm/MediaTek; Explore more aggressive quantization like int2; Extend compressed-domain computation to other operators beyond attention.