Zing Forum

Reading

DFlare Breaks Block Diffusion Speculative Decoding Bottleneck: 5.52x Inference Speedup via Layer-Wise Fusion Mechanism

The Tencent AngelSlim team proposed DFlare, which expands the draft model capacity through a layer-wise fusion mechanism, achieving a 5.52x wall-clock speedup on Qwen3-4B—an 11% improvement over DFlash.

DFlare投机解码块扩散推理加速AngelSlim腾讯LLM推理扩散模型
Published 2026-06-01 19:18Recent activity 2026-06-02 11:25Estimated read 6 min
DFlare Breaks Block Diffusion Speculative Decoding Bottleneck: 5.52x Inference Speedup via Layer-Wise Fusion Mechanism
1

Section 01

DFlare: Breakthrough in Block Diffusion Speculative Decoding with 5.52x Speedup on Qwen3-4B

Tencent AngelSlim team proposed DFlare, a block diffusion speculative decoding method that uses layer-wise fusion to scale draft model capacity. It achieves 5.52x wall-clock acceleration on Qwen3-4B, which is 11% better than DFlash. This work addresses the bottleneck of DFlash and provides a new solution for LLM inference speedup.

2

Section 02

Background: Evolution of Speculative Decoding & DFlash's Bottlenecks

Traditional Speculative Decoding

  • Core idea: Use small draft model to generate candidate tokens, then big target model to verify.
  • Challenges: Low acceptance rate if model gap is large; need two independent models.

Block Diffusion Speculative Decoding (DFlash)

  • Uses single model as both draft generator and target validator.
  • Draft phase: Predict block tokens via diffusion; validation phase: parallel verify block.
  • Bottleneck: All draft layers share single fusion representation from few target layers, limiting expressiveness and capacity expansion.
3

Section 03

DFlare's Core: Layer-Wise Fusion Mechanism

Key Innovation

  • Layer-wise fusion: Each draft layer learns to focus on weighted combination of target layers, getting customized input.
  • Lightweight implementation: Uses attention mechanism with minimal extra cost, end-to-end trainable.

Training Data Expansion

  • Increased from DFlash's 800K samples to 2.4M samples to fully utilize expanded draft capacity.
4

Section 04

Experimental Results: Significant Speedup Across Models & Tasks

Wall-Clock Acceleration

Model DFlare 加速 DFlash 基线 提升幅度
Qwen3-4B 5.52x ~4.97x +11%
Qwen3-8B 5.46x ~5.06x +8%
GPT-OSS-20B 3.91x ~3.72x +5%

Key Observations

  • Smaller models gain more (11% for 4B vs 5% for 20B).
  • Consistent performance across math reasoning, code generation, and dialogue tasks.
5

Section 05

Technical Deep Dive: Diffusion Model & Inter-Layer Attention

Diffusion Model Role

  • Parallel token generation for blocks.
  • Iterative refinement to improve quality.
  • Flexible conditional control.

Inter-Layer Attention

  • Query: Draft layer representation.
  • Key/Value: Target model layer representations.
  • Output: Weighted fusion for customized input.

Training Strategy

  • Combines diffusion training, layer fusion learning, and multi-task adaptation.
6

Section 06

Comparison with Related Work & Open Source Impact

vs DFlash

Feature DFlash DFlare
Conditional Representation Single Fusion Layer-Wise Differentiation
Source of Target Layers Few Layers Broad Layer Set
Draft Capacity Expansion Limited Supports Deeper Architectures
Training Data 800K 2.4M

vs Traditional Speculative Decoding

  • Single model (no separate draft model).
  • Block-level parallel generation.
  • End-to-end trainable.

Open Source

7

Section 07

Application Scenarios & Future Directions

Application Scenarios

  1. High-throughput API services.
  2. Real-time interaction (chatbots, assistants).
  3. Edge deployment (resource-constrained devices).
  4. Cost-sensitive applications.

Deployment Challenges

  • Memory usage of diffusion models.
  • Batch processing strategy integration.
  • Hardware adaptation.

Future Directions

  • Scale to 100B+ models.
  • Extend to multi-modal models.
  • Dynamic adjustment of block size and diffusion steps.
  • Hardware co-design optimization.
8

Section 08

Conclusion: DFlare's Value for LLM Inference Optimization

DFlare breaks DFlash's capacity bottleneck via layer-wise fusion, achieving significant speedup with minimal overhead. The 5.52x acceleration brings:

  • Better user experience (faster response).
  • Lower operational cost (reduced compute resources).
  • Higher scalability (supports more concurrent requests).

This work highlights the importance of architecture design details. With open source code, it is expected to drive further improvements in LLM inference optimization.