# IntAttention: A Pure Integer Attention Inference Acceleration Scheme for Edge Devices

> Open-source implementation of an MLSys 2026 paper, enabling high-fidelity and high-speed inference of large models and Vision Transformers on ARM CPUs via an all-integer attention pipeline.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-19T19:14:13.000Z
- 最近活动: 2026-04-19T19:20:17.638Z
- 热度: 150.9
- 关键词: IntAttention, 整数量化, 边缘推理, Transformer优化, ARM CPU, MLSys 2026, 注意力机制, 模型部署
- 页面链接: https://www.zingnex.cn/en/forum/thread/intattention
- Canonical: https://www.zingnex.cn/forum/thread/intattention
- Markdown 来源: floors_fallback

---

## [Overview] IntAttention: A Pure Integer Attention Inference Acceleration Scheme for Edge Devices

IntAttention is the open-source implementation of an MLSys 2026 paper. It proposes an all-integer attention pipeline to enable high-fidelity and high-speed inference of Large Language Models (LLMs) and Vision Transformers (ViTs) on ARM CPUs, aiming to address the computational power bottleneck of deploying Transformer models on edge devices.

## Background: Computational Power and Attention Quantization Challenges in Edge AI

With the popularization of LLMs and ViTs, edge device deployment faces issues like high floating-point computation overhead, high latency, and high energy consumption. While quantization techniques can optimize these, existing solutions often ignore the complex operations of the attention mechanism; matrix multiplication and Softmax in the attention mechanism are prone to precision loss and numerical overflow under integer quantization. Balancing precision and efficiency remains an open problem.

## Core Innovations: All-Integer Attention Pipeline and Key Optimizations

The core of IntAttention is an all-integer attention pipeline that covers the entire process of Query-Key dot product, Softmax normalization, and Attention-Value multiplication. Key optimizations include: 1. Integer Softmax replaces floating-point exponentiation and division with Look-Up Tables (LUTs) and fixed-point operations; 2. Layer-wise dynamic quantization, adjusting scaling factors and zero points based on the activation distribution of each layer; 3. Blocked memory layout optimization to improve cache hit rate.

## Experimental Results: Win-Win of Speed and Precision

Tests were conducted on models like LLaMA, BERT, ViT, and ARM CPUs such as Qualcomm Snapdragon and Apple M-series: Compared to floating-point baselines, inference speed increased by 2-4 times, and memory usage decreased by approximately 50%; in terms of precision, the accuracy difference from floating-point models was less than 1% in benchmark tests like GLUE and ImageNet.

## Application Scenarios: Mobile Intelligent Assistants, Real-Time Visual Understanding, etc.

IntAttention can be applied to: 1. Mobile intelligent assistants, running LLMs locally to achieve privacy protection and low latency; 2. Real-time visual understanding, running ViTs on camera terminals for security and autonomous driving assistance; 3. IoT devices, running Transformer models on embedded devices to upgrade smart homes and industrial inspection.

## Open-Source Ecosystem: Open Code, Support for Multi-Platforms and Model Conversion

IntAttention's code is fully open-source, providing model conversion tools for PyTorch and ONNX formats; it supports optimized kernels for ARM NEON and x86 AVX2; the official team provides tutorials, pre-trained models, and complete deployment processes, and the community is actively expanding support for multimodal models.

## Technical Outlook: Hardware-Aware Optimization and Multi-Platform Expansion

IntAttention represents the hardware-aware direction of edge AI inference optimization. In the future, it will expand to platforms like RISC-V and NPU, and combine with sparsification and pruning techniques to further unleash the AI potential of edge devices.
