Reading

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context Inference on Apple Silicon

Open-TQ-Metal is the first solution to implement fused compressed-domain attention on Apple Silicon, enabling the Llama 3.1 70B model with 128K context to run on a single 64GB consumer-grade Mac. By using custom Metal compute shaders to directly compute attention on int4 compressed representations, it achieves 48x attention acceleration and 3.2x memory compression under 128K context.

长上下文推理KV缓存量化Apple Silicon注意力机制端侧AI模型压缩消费级硬件

Published 2026-04-18 18:39Recent activity 2026-04-21 10:23Estimated read 5 min

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context Inference on Apple Silicon

Section 01

Open-TQ-Metal: A Groundbreaking Solution for Edge Long-Context Inference on Apple Silicon

Section 02

Core Challenges of Running Long-Context Models on Consumer Hardware

The long-context capability of large language models has expanded to 128K or even millions of tokens, but is often locked to expensive data center GPUs. Take Llama 3.1 70B as an example: under FP16 precision, the KV cache for 128K context requires about 40GB of memory, exceeding the capacity of most consumer devices. Existing frameworks either do not support this configuration or rely on memory swapping, leading to a sharp drop in inference speed.

Section 03

Core Technology: An Innovative Paradigm of Fused Compressed-Domain Attention

The traditional KV cache quantization process is quantization → dequantization → FP16 attention computation, which has high overhead. Open-TQ-Metal's innovations: 1. Instantly quantize KV cache to int4; 2. Directly compute attention in the int4 compressed domain using custom Metal shaders without dequantization; 3. Eliminate the FP16 intermediate matrix after dequantization, saving memory and data movement overhead.

Section 04

Performance Verification: Measured Results of 48x Acceleration and 3.2x Memory Compression

330 experiments covering Gemma 4 31B and Llama 3.1 70B: Under 128K context, the fused sdpa_int4 kernel achieves 48x attention acceleration, and the top-1 prediction is consistent with FP16; KV cache reduced from 40GB to 12.5GB, a 3.2x compression ratio; First to support a single 64GB Mac to run Llama 3.1 70B with 128K context.

Section 05

Cross-Architecture Quantization Insight: The Key Role of Attention Scaling Factor

Open-TQ-Metal is the first to systematically analyze cross-architecture KV cache quantization, finding that the key to the success or failure of angular quantization schemes (such as PolarQuant) is the attention scaling factor: When Gemma 4 uses attn_scale=1.0, the directional error is amplified by 25-100 times; Llama uses 1/sqrt(d) scaling, which controls errors better, explaining the performance differences of quantization schemes across different models.

Section 06

Application Prospects: Edge AI Democratization and Open-Source Ecosystem Contributions

Open-TQ-Metal lowers the threshold for using long-context models: Developers can prototype applications on local Macs, process long documents/codebases (without cloud services), and protect data privacy; The open-source release provides Apple Silicon optimization technology, compressed-domain attention examples, and cross-model quantization methodology; The principle can be extended to mobile edge devices.

Section 07

Limitations and Future: Expanding Platforms and More Aggressive Quantization Exploration

Current limitations: Only supports Apple Silicon, mainly covers the Gemma/Llama family, and focuses on int4. Future directions: Expand to hardware such as Qualcomm/MediaTek; Explore more aggressive quantization like int2; Extend compressed-domain computation to other operators beyond attention.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49