Section 01
[Introduction] MISA: An Efficient Sparse Attention Optimization Scheme for Long-Context LLM Inference
MISA is a Mixture-of-Experts mechanism for indexer sparse attention in long-context LLM inference. Its core innovation is treating the index heads of DeepSeek Sparse Attention as an expert pool, dynamically selecting a small number of active heads (only 8 in experiments) via a lightweight router for token-level scoring. Without additional training, its performance is comparable to the original 64-head indexer, while achieving a 3.82x kernel speedup.