# MiniMax Sparse Attention: A New Efficient Attention Paradigm for Million-Scale Long Context

> MiniMax proposes the MSA sparse attention mechanism, which dynamically selects key KV blocks via a lightweight indexing branch. On a 109B parameter model, it achieves a 28.4x reduction in computational load while maintaining performance comparable to GQA.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-11T14:23:41.000Z
- 最近活动: 2026-06-12T01:19:47.098Z
- 热度: 111.1
- 关键词: 稀疏注意力, 长上下文, 大语言模型, MiniMax, GQA, 推理加速, GPU优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/minimax-sparse-attention
- Canonical: https://www.zingnex.cn/forum/thread/minimax-sparse-attention
- Markdown 来源: floors_fallback

---

## [Introduction] MiniMax Sparse Attention: A New Efficient Attention Paradigm for Million-Scale Long Context

### Key Information
- **Mechanism**: MiniMax proposes the MSA sparse attention mechanism, which dynamically selects key KV blocks via a lightweight indexing branch
- **Effect**: On a 109B parameter model, it achieves a 28.4x reduction in computational load, with performance comparable to GQA
- **Source**: By Xunhao Lai et al. (MiniMax team) published on arXiv on June 11, 2026. Open-source code and models can be found at https://github.com/MiniMax-AI/MSA and https://huggingface.co/MiniMaxAI/MiniMax-M3
- **Keywords**: Sparse attention, long context, large language model, MiniMax, GQA, inference acceleration, GPU optimization

This article will analyze from aspects such as background, architecture, optimization, and experiments

## Introduction / Main Post: MiniMax Sparse Attention: A New Efficient Attention Paradigm for Million-Scale Long Context

MiniMax proposes the MSA sparse attention mechanism, which dynamically selects key KV blocks via a lightweight indexing branch. On a 109B parameter model, it achieves a 28.4x reduction in computational load while maintaining performance comparable to GQA.

## Original Authors and Source

- **Original Authors/Team**: Xunhao Lai, Weiqi Xu, Yufeng Yang et al. (MiniMax and collaborating institutions)
- **Source Platform**: arXiv
- **Original Title**: MiniMax Sparse Attention
- **Original Link**: https://arxiv.org/abs/2606.13392
- **Publication Time**: June 11, 2026
- **Open-Source Code**: https://github.com/MiniMax-AI/MSA
- **Model Release**: https://huggingface.co/MiniMaxAI/MiniMax-M3

---

## Long Context Becomes a New Battlefield for Large Models

Current large language models are undergoing a profound paradigm shift. From early single-turn short conversations to today's agent workflows requiring hundreds of interaction steps, warehouse-level code reasoning, and persistent memory systems, models need to simultaneously attend to tokens ranging from hundreds of thousands to millions. This ultra-long context capability has become one of the core competencies of cutting-edge large models.

However, the traditional softmax attention mechanism faces fundamental bottlenecks: its computational complexity is proportional to the square of the sequence length. When the context expands to the million scale, computational costs and memory usage inflate sharply, making it unbearable in practical deployment. How to break through this efficiency bottleneck while maintaining model quality has become a focus of common concern in academia and industry.

---