Zing Forum

Reading

DASH-KV: Asymmetric Hashing Enables Linear-Complexity Inference for Long-Context LLMs

DASH-KV reframes the attention mechanism as an approximate nearest neighbor search using asymmetric deep hashing, reducing the complexity of long-context LLM inference from O(N²) to O(N) while maintaining the performance of full-precision attention.

长上下文推理注意力机制KV Cache近似最近邻搜索深度哈希LLM优化线性复杂度LongBench
Published 2026-04-21 19:33Recent activity 2026-04-23 09:51Estimated read 6 min
DASH-KV: Asymmetric Hashing Enables Linear-Complexity Inference for Long-Context LLMs
1

Section 01

DASH-KV: Asymmetric Hashing Enables Linear-Complexity Inference for Long-Context LLMs (Introduction)

DASH-KV reframes the attention mechanism as an approximate nearest neighbor search using asymmetric deep hashing technology, successfully reducing the complexity of long-context LLM inference from O(N²) to linear O(N) while maintaining performance comparable to full-precision attention, thus solving the bottleneck problem of traditional attention mechanisms in long-sequence processing.

2

Section 02

Dilemmas of Long-Context Inference and Limitations of Existing Solutions

The computational complexity of traditional LLM attention mechanisms is proportional to the square of the sequence length (O(N²)), leading to a sharp increase in latency when processing long documents, codebases, or multi-turn dialogues. Existing solutions have limitations: KV Cache compression only alleviates memory pressure, sacrifices generation quality, and does not reduce computational overhead; sparse attention reduces computational load but significantly degrades performance in tasks involving global dependency modeling.

3

Section 03

Core Design of DASH-KV: Asymmetric Encoding and Dynamic Mixed Precision

The core idea of DASH-KV is to reframe attention computation as an approximate nearest neighbor search. Its key innovations include:

  1. Asymmetric Encoding: Queries are mapped to compact hash codes (low precision, low overhead), while keys retain high-precision representations (to ensure attention accuracy);
  2. Dynamic Mixed Precision Mechanism: Adaptively identify key tokens—important tokens take the full-precision path, ordinary tokens take the hash acceleration path, and results are seamlessly fused.
4

Section 04

Technical Implementation Details of DASH-KV

Deep Hashing Network

A lightweight deep network is used to map queries to binary/low-bit hash codes, with features including: learnable hashing (optimized for attention), end-to-end training (jointly optimized with the main model), and hardware-friendliness (supports bit operations and SIMD acceleration).

Approximate Nearest Neighbor Search

A multi-stage strategy is adopted: coarse filtering (fast candidate key selection via hash codes) → fine ranking (detailed similarity calculation) → Top-K selection (selecting the most similar keys), converting full attention into local computation to achieve linear complexity.

5

Section 05

Experimental Evaluation: Win-Win in Performance and Efficiency

Evaluated on the LongBench benchmark (covering single/multi-document QA, summarization, few-shot learning, etc.):

  • Performance: On par with full-precision attention, outperforming existing baselines;
  • Complexity: Successfully reduced to O(N), with significant acceleration effects for long sequences;
  • Memory Efficiency: Hash codes greatly reduce KV Cache usage, supporting longer contexts.
6

Section 06

Comparison with Related Work: Unique Advantages of DASH-KV

DASH-KV achieves linear complexity while maintaining the expressive power of full attention, with obvious advantages over other methods:

Method Type Complexity Main Limitations DASH-KV Advantages
Full Attention O(N²) Long sequences infeasible Linear complexity
KV Compression O(N²) Only relieves memory Reduces computational overhead
Sparse Attention O(N) Structural constraints No structural constraints, maintains global capability
Linear Attention O(N) Loss of expressive power Maintains full-precision performance
7

Section 07

Application Value and Future Outlook

Application Scenarios

  • Long document processing (legal, academic, technical manuals);
  • Code understanding and generation (large codebases);
  • Multi-turn dialogues (longer history, improved coherence);
  • Retrieval-augmented generation (more retrieval results, better answer quality).

Limitations & Outlook

  • Hash quality depends on deep network learning effects, requiring adaptation to out-of-distribution data;
  • Hardware optimization space: deep integration with GPU kernels;
  • Can be combined with KV quantization, model quantization, and other technologies to enhance benefits.