Section 01
MiniMax Sparse Attention Mechanism: Guide to Efficient Inference for Million-Scale Long Contexts
The Sparse Attention Mechanism (MSA) proposed by the MiniMax team addresses the quadratic complexity issue of traditional softmax attention. Based on block-level sparse design of Grouped Query Attention (GQA), it achieves a 28.4x reduction in computation for million-scale contexts on a 109B-parameter model. Combined with GPU kernel optimizations, it delivers 14.2x faster pre-filling and 7.6x faster decoding speeds while maintaining model performance comparable to the original GQA, providing a practical solution for deploying large models with ultra-long contexts.