Zing Forum

Reading

IntAttention: A Pure Integer Attention Inference Acceleration Scheme for Edge Devices

Open-source implementation of an MLSys 2026 paper, enabling high-fidelity and high-speed inference of large models and Vision Transformers on ARM CPUs via an all-integer attention pipeline.

IntAttention整数量化边缘推理Transformer优化ARM CPUMLSys 2026注意力机制模型部署
Published 2026-04-20 03:14Recent activity 2026-04-20 03:20Estimated read 5 min
IntAttention: A Pure Integer Attention Inference Acceleration Scheme for Edge Devices
1

Section 01

[Overview] IntAttention: A Pure Integer Attention Inference Acceleration Scheme for Edge Devices

IntAttention is the open-source implementation of an MLSys 2026 paper. It proposes an all-integer attention pipeline to enable high-fidelity and high-speed inference of Large Language Models (LLMs) and Vision Transformers (ViTs) on ARM CPUs, aiming to address the computational power bottleneck of deploying Transformer models on edge devices.

2

Section 02

Background: Computational Power and Attention Quantization Challenges in Edge AI

With the popularization of LLMs and ViTs, edge device deployment faces issues like high floating-point computation overhead, high latency, and high energy consumption. While quantization techniques can optimize these, existing solutions often ignore the complex operations of the attention mechanism; matrix multiplication and Softmax in the attention mechanism are prone to precision loss and numerical overflow under integer quantization. Balancing precision and efficiency remains an open problem.

3

Section 03

Core Innovations: All-Integer Attention Pipeline and Key Optimizations

The core of IntAttention is an all-integer attention pipeline that covers the entire process of Query-Key dot product, Softmax normalization, and Attention-Value multiplication. Key optimizations include: 1. Integer Softmax replaces floating-point exponentiation and division with Look-Up Tables (LUTs) and fixed-point operations; 2. Layer-wise dynamic quantization, adjusting scaling factors and zero points based on the activation distribution of each layer; 3. Blocked memory layout optimization to improve cache hit rate.

4

Section 04

Experimental Results: Win-Win of Speed and Precision

Tests were conducted on models like LLaMA, BERT, ViT, and ARM CPUs such as Qualcomm Snapdragon and Apple M-series: Compared to floating-point baselines, inference speed increased by 2-4 times, and memory usage decreased by approximately 50%; in terms of precision, the accuracy difference from floating-point models was less than 1% in benchmark tests like GLUE and ImageNet.

5

Section 05

Application Scenarios: Mobile Intelligent Assistants, Real-Time Visual Understanding, etc.

IntAttention can be applied to: 1. Mobile intelligent assistants, running LLMs locally to achieve privacy protection and low latency; 2. Real-time visual understanding, running ViTs on camera terminals for security and autonomous driving assistance; 3. IoT devices, running Transformer models on embedded devices to upgrade smart homes and industrial inspection.

6

Section 06

Open-Source Ecosystem: Open Code, Support for Multi-Platforms and Model Conversion

IntAttention's code is fully open-source, providing model conversion tools for PyTorch and ONNX formats; it supports optimized kernels for ARM NEON and x86 AVX2; the official team provides tutorials, pre-trained models, and complete deployment processes, and the community is actively expanding support for multimodal models.

7

Section 07

Technical Outlook: Hardware-Aware Optimization and Multi-Platform Expansion

IntAttention represents the hardware-aware direction of edge AI inference optimization. In the future, it will expand to platforms like RISC-V and NPU, and combine with sparsification and pruning techniques to further unleash the AI potential of edge devices.