正文

DFlare 突破块扩散投机解码瓶颈：逐层融合机制实现 5.52 倍推理加速

腾讯 AngelSlim 团队提出 DFlare，通过逐层融合机制扩展草稿模型容量，在 Qwen3-4B 上实现 5.52 倍 wall-clock 加速，相比 DFlash 提升 11%。

DFlare投机解码块扩散推理加速AngelSlim腾讯LLM推理扩散模型

发布时间 2026/06/01 19:18最近活动 2026/06/02 11:25预计阅读 6 分钟

章节 01

DFlare: Breakthrough in Block Diffusion Speculative Decoding with 5.52x Speedup on Qwen3-4B

Tencent AngelSlim team proposed DFlare, a block diffusion speculative decoding method that uses layer-wise fusion to scale draft model capacity. It achieves 5.52x wall-clock acceleration on Qwen3-4B, which is 11% better than DFlash. This work addresses the bottleneck of DFlash and provides a new solution for LLM inference speedup.

章节 02

Background: Evolution of Speculative Decoding & DFlash's Bottlenecks

Traditional Speculative Decoding

Core idea: Use small draft model to generate candidate tokens, then big target model to verify.
Challenges: Low acceptance rate if model gap is large; need two independent models.

Block Diffusion Speculative Decoding (DFlash)

Uses single model as both draft generator and target validator.
Draft phase: Predict block tokens via diffusion; validation phase: parallel verify block.
Bottleneck: All draft layers share single fusion representation from few target layers, limiting expressiveness and capacity expansion.

章节 03

DFlare's Core: Layer-Wise Fusion Mechanism

Key Innovation

Layer-wise fusion: Each draft layer learns to focus on weighted combination of target layers, getting customized input.
Lightweight implementation: Uses attention mechanism with minimal extra cost, end-to-end trainable.

Training Data Expansion

Increased from DFlash's 800K samples to 2.4M samples to fully utilize expanded draft capacity.

章节 04

Experimental Results: Significant Speedup Across Models & Tasks

Wall-Clock Acceleration

Model	DFlare 加速	DFlash 基线	提升幅度
Qwen3-4B	5.52x	~4.97x	+11%
Qwen3-8B	5.46x	~5.06x	+8%
GPT-OSS-20B	3.91x	~3.72x	+5%

Key Observations

Smaller models gain more (11% for 4B vs 5% for 20B).
Consistent performance across math reasoning, code generation, and dialogue tasks.

章节 05

Technical Deep Dive: Diffusion Model & Inter-Layer Attention

Diffusion Model Role

Parallel token generation for blocks.
Iterative refinement to improve quality.
Flexible conditional control.

Inter-Layer Attention

Query: Draft layer representation.
Key/Value: Target model layer representations.
Output: Weighted fusion for customized input.

Training Strategy

Combines diffusion training, layer fusion learning, and multi-task adaptation.

章节 06

Comparison with Related Work & Open Source Impact

vs DFlash

特性	DFlash	DFlare
条件表示	单一融合	逐层差异化
目标层来源	少数层	广泛层集合
草稿容量扩展	受限	支持更深架构
训练数据	800K	240 万

vs Traditional Speculative Decoding

Single model (no separate draft model).
Block-level parallel generation.
End-to-end trainable.

Open Source

Code repo: https://github.com/Tencent/AngelSlim
Part of Tencent AngelSlim project (focus on LLM inference optimization).

章节 07

Application Scenarios & Future Directions

Application Scenarios

High-throughput API services.
Real-time interaction (chatbots, assistants).
Edge deployment (resource-constrained devices).
Cost-sensitive applications.

Deployment Challenges

Memory占用 of diffusion models.
Batch processing strategy integration.
Hardware adaptation.

Future Directions

Scale to 100B+ models.
Extend to multi-modal models.
Dynamic adjustment of block size and diffusion steps.
Hardware co-design optimization.

章节 08

Conclusion: DFlare's Value for LLM Inference Optimization

DFlare breaks DFlash's capacity bottleneck via layer-wise fusion, achieving significant speedup with minimal overhead. The 5.52x acceleration brings:

Better user experience (faster response).
Lower operational cost (reduced compute resources).
Higher scalability (supports more concurrent requests).

This work highlights the importance of architecture design details. With open source code, it is expected to drive further improvements in LLM inference optimization.