Reading

DFlare Breaks Block Diffusion Speculative Decoding Bottleneck: 5.52x Inference Speedup via Layer-Wise Fusion Mechanism

The Tencent AngelSlim team proposed DFlare, which expands the draft model capacity through a layer-wise fusion mechanism, achieving a 5.52x wall-clock speedup on Qwen3-4B—an 11% improvement over DFlash.

DFlare投机解码块扩散推理加速AngelSlim腾讯LLM推理扩散模型

Published 2026-06-01 19:18Recent activity 2026-06-02 11:25Estimated read 6 min

DFlare Breaks Block Diffusion Speculative Decoding Bottleneck: 5.52x Inference Speedup via Layer-Wise Fusion Mechanism

Section 01

DFlare: Breakthrough in Block Diffusion Speculative Decoding with 5.52x Speedup on Qwen3-4B

Tencent AngelSlim team proposed DFlare, a block diffusion speculative decoding method that uses layer-wise fusion to scale draft model capacity. It achieves 5.52x wall-clock acceleration on Qwen3-4B, which is 11% better than DFlash. This work addresses the bottleneck of DFlash and provides a new solution for LLM inference speedup.

Section 02

Background: Evolution of Speculative Decoding & DFlash's Bottlenecks

Traditional Speculative Decoding

Core idea: Use small draft model to generate candidate tokens, then big target model to verify.
Challenges: Low acceptance rate if model gap is large; need two independent models.

Block Diffusion Speculative Decoding (DFlash)

Uses single model as both draft generator and target validator.
Draft phase: Predict block tokens via diffusion; validation phase: parallel verify block.
Bottleneck: All draft layers share single fusion representation from few target layers, limiting expressiveness and capacity expansion.

Section 03

DFlare's Core: Layer-Wise Fusion Mechanism

Key Innovation

Layer-wise fusion: Each draft layer learns to focus on weighted combination of target layers, getting customized input.
Lightweight implementation: Uses attention mechanism with minimal extra cost, end-to-end trainable.

Training Data Expansion

Increased from DFlash's 800K samples to 2.4M samples to fully utilize expanded draft capacity.

Section 04

Experimental Results: Significant Speedup Across Models & Tasks

Wall-Clock Acceleration

Model	DFlare 加速	DFlash 基线	提升幅度
Qwen3-4B	5.52x	~4.97x	+11%
Qwen3-8B	5.46x	~5.06x	+8%
GPT-OSS-20B	3.91x	~3.72x	+5%

Key Observations

Smaller models gain more (11% for 4B vs 5% for 20B).
Consistent performance across math reasoning, code generation, and dialogue tasks.

Section 05

Technical Deep Dive: Diffusion Model & Inter-Layer Attention

Diffusion Model Role

Parallel token generation for blocks.
Iterative refinement to improve quality.
Flexible conditional control.

Inter-Layer Attention

Query: Draft layer representation.
Key/Value: Target model layer representations.
Output: Weighted fusion for customized input.

Training Strategy

Combines diffusion training, layer fusion learning, and multi-task adaptation.

Section 06

Comparison with Related Work & Open Source Impact

vs DFlash

Feature	DFlash	DFlare
Conditional Representation	Single Fusion	Layer-Wise Differentiation
Source of Target Layers	Few Layers	Broad Layer Set
Draft Capacity Expansion	Limited	Supports Deeper Architectures
Training Data	800K	2.4M

vs Traditional Speculative Decoding

Single model (no separate draft model).
Block-level parallel generation.
End-to-end trainable.

Open Source

Code repo: https://github.com/Tencent/AngelSlim
Part of Tencent AngelSlim project (focus on LLM inference optimization).

Section 07

Application Scenarios & Future Directions

Application Scenarios

High-throughput API services.
Real-time interaction (chatbots, assistants).
Edge deployment (resource-constrained devices).
Cost-sensitive applications.

Deployment Challenges

Memory usage of diffusion models.
Batch processing strategy integration.
Hardware adaptation.

Future Directions

Scale to 100B+ models.
Extend to multi-modal models.
Dynamic adjustment of block size and diffusion steps.
Hardware co-design optimization.

Section 08

Conclusion: DFlare's Value for LLM Inference Optimization

DFlare breaks DFlash's capacity bottleneck via layer-wise fusion, achieving significant speedup with minimal overhead. The 5.52x acceleration brings:

Better user experience (faster response).
Lower operational cost (reduced compute resources).
Higher scalability (supports more concurrent requests).

This work highlights the importance of architecture design details. With open source code, it is expected to drive further improvements in LLM inference optimization.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15