Section 01
SparDA: Decoupled Sparse Attention Achieves 5.3x Acceleration in Long Text Inference (Introduction)
NVIDIA Labs (NVlabs) released the SparDA technology on arXiv on June 3, 2026 (original paper title: SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference, link: http://arxiv.org/abs/2606.04511v1, open-source code: https://github.com/NVlabs/SparDA). By introducing a fourth projection layer called Forecast to enable KV cache prefetching, this technology achieves 1.25x prefill speedup and 1.7x decoding speedup on 8B models, with a 5.3x increase in single-GPU decoding throughput. It also maintains or slightly improves model accuracy, providing an efficient solution for long-text inference scenarios.