Section 01
Open-TQ-Metal: A Groundbreaking Solution for Edge Long-Context Inference on Apple Silicon
Open-TQ-Metal is the first solution to implement fused compressed-domain attention on Apple Silicon, enabling the Llama 3.1 70B model with 128K context to run on a single 64GB consumer-grade Mac. By using custom Metal compute shaders to directly compute attention on int4 compressed representations, it achieves 48x attention acceleration and 3.2x memory compression, providing a feasible path for consumer devices to run long-context large models.