Zing Forum

Reading

DashAttention: A Differentiable and Adaptive Sparse Hierarchical Attention Mechanism

This article introduces DashAttention, an efficient attention mechanism that uses the α-entmax transformation to achieve adaptive sparse block selection. It maintains accuracy comparable to full attention while achieving 75% sparsity, and its inference speed surpasses FlashAttention-3.

注意力机制长上下文稀疏注意力FlashAttentionLLM优化α-entmax
Published 2026-05-19 01:59Recent activity 2026-05-19 11:27Estimated read 5 min
DashAttention: A Differentiable and Adaptive Sparse Hierarchical Attention Mechanism
1

Section 01

DashAttention: A Differentiable and Adaptive Sparse Hierarchical Attention Mechanism

DashAttention is an innovative sparse hierarchical attention mechanism proposed in May 2026, designed to address the bottleneck of quadratic computation and memory overhead of full attention in long-context modeling for large language models (LLMs). Its core advantage lies in using the α-entmax transformation to achieve adaptive sparse block selection, maintaining accuracy comparable to full attention while reaching 75% sparsity, and its inference speed surpasses FlashAttention-3.

2

Section 02

Background: Current Status and Limitations of Hierarchical Attention

Current hierarchical attention methods (such as NSA and InfLLMv2) adopt a two-stage strategy: coarse-grained selection of top-k KV blocks, followed by fine-grained application of softmax attention on the selected tokens. However, there are limitations: 1. The fixed quantity assumption fails to adapt to the differences in information needs of different queries; 2. The top-k operation is discrete and discontinuous, blocking gradient flow and preventing end-to-end optimization.

3

Section 03

Core Innovations: Adaptive Sparsity and Differentiable Design

DashAttention has two major innovations: 1. α-entmax adaptive sparse selection: dynamically selects a variable number of KV blocks based on query needs, avoiding the one-size-fits-all problem of top-k; 2. Fully differentiable hierarchical architecture: sparse selection and attention computation maintain continuous gradients, supporting end-to-end optimization. In addition, its non-dispersive property prevents attention from being scattered to irrelevant tokens.

4

Section 04

Experimental Evidence: Excellent Performance in Accuracy and Efficiency

Experimental results show: 1. Accuracy: At 75% sparsity, it is comparable to full attention, and its Pareto frontier (accuracy vs. efficiency) is better than NSA and InfLLMv2; 2. Inference speed: The GPU version implemented with Triton surpasses FlashAttention-3; 3. Long-context capability: The non-dispersive property performs prominently in precise retrieval and reasoning tasks.

5

Section 05

Technical Implementation Details

  1. α-entmax transformation: A generalized form of softmax, where α between 1 and 2 produces a sparse distribution; 2. Two-stage process: Coarse-grained block selection using α-entmax, followed by fine-grained softmax with prior weights; 3. Triton implementation: Custom GPU kernels optimize memory hierarchy and computational characteristics, converting theoretical advantages into practical acceleration.
6

Section 06

Application Scenario Outlook

DashAttention is suitable for scenarios such as: long document understanding (legal documents, technical manuals), code repository analysis (cross-file understanding), dialogue systems (maintaining ultra-long history), multimodal long sequences (processing large numbers of visual tokens), etc.

7

Section 07

Conclusion: An Efficient Solution for Long-Context Modeling

DashAttention balances accuracy and efficiency through adaptive sparsity and differentiable design, making it a highly competitive sparse attention method currently. As the demand for long-context in LLMs grows, such mechanisms will play an important role in future model architectures. Paper link: http://arxiv.org/abs/2605.18753v1, published on May 18, 2026.