Zing Forum

Reading

TIDE: Cross-Architecture Distillation Enables High Performance for Small-Parameter Diffusion Language Models

Diffusion Language Models (dLLM) excel in parallel decoding and bidirectional context, but their performance is too tightly bound to parameter scale. The TIDE framework achieves cross-architecture knowledge distillation for the first time, compressing 8B/16B teacher models into a 0.6B student model, with the code generation HumanEval score jumping from 32.3 to 48.78.

扩散语言模型知识蒸馏模型压缩跨架构迁移代码生成并行解码
Published 2026-04-30 01:59Recent activity 2026-04-30 10:32Estimated read 5 min
TIDE: Cross-Architecture Distillation Enables High Performance for Small-Parameter Diffusion Language Models
1

Section 01

Introduction: TIDE Framework Enables Cross-Architecture Distillation, Significantly Boosting Small-Parameter dLLM Performance

Diffusion Language Models (dLLM) have advantages in parallel decoding and bidirectional context modeling, but their performance is too tightly bound to parameter scale. The TIDE framework achieves cross-architecture knowledge distillation for the first time, compressing an 8B dense model and a 16B MoE model into a lightweight 0.6B student model. On the code generation task HumanEval, its score jumps from 32.3 to 48.78, breaking the scale bottleneck for the practical application of dLLMs.

2

Section 02

Background: Advantages and Scale Dilemma of dLLMs, and Limitations of Existing Distillation Methods

dLLMs differ from traditional autoregressive models by virtue of parallel decoding and bidirectional context modeling, but they require billions of parameters to be competitive, which restricts deployment. Knowledge distillation is a mainstream model compression method, but existing approaches are limited to within a single architecture and fail to solve the problem of knowledge transfer across architectures (e.g., differences in architecture, attention mechanisms, and tokenizers between teacher and student models).

3

Section 03

Analysis of Three Innovative Components of the TIDE Framework

TIDE solves cross-architecture distillation challenges through three components:

TIDAL: Dynamically Adjusting Distillation Intensity

Jointly models training progress and diffusion timesteps. In the early stages of training, it focuses on high-noise steps; in the later stages, it strengthens fine-grained steps to avoid efficiency loss.

CompDemo: Complementary Masking for Context Enhancement

Splits the input into complementary parts and performs two forward passes to supplement context, improving the teacher model's prediction quality in high-masking scenarios.

Reverse CALM: Cross-Tokenizer Alignment

Reverse-maps the student model's probability distribution to the teacher's token space, stabilizing gradient boundaries and filtering noise bidirectionally.

4

Section 04

Experimental Validation: TIDE's Performance Breakthroughs Across Multiple Tasks

The experiment constructs two heterogeneous distillation pipelines: 8B dense dLLM → 0.6B student, and 16B MoE →0.6B student. It achieves an average improvement of 1.53 points across 8 benchmark tests, with the code generation HumanEval score reaching 48.78—over 50% relative improvement compared to the autoregressive baseline (32.3), demonstrating the advantages of dLLMs in bidirectional context and parallel decoding tasks.

5

Section 05

Conclusion: TIDE Opens a New Path for the Practical Application of dLLMs

TIDE proves that cross-architecture knowledge transfer for dLLMs is feasible. Its modular components (TIDAL, CompDemo, Reverse CALM) can be independently applied to scenarios such as progressive learning, semi-supervised learning, and cross-representation alignment. This marks the evolution of model compression from homogeneous to heterogeneous, requiring alignment mechanisms designed for specific architectural characteristics.

6

Section 06

Future Outlook: Promotion of TIDE Components and Deployment Prospects of dLLMs

TIDE components can be extended to other tasks (e.g., progressive learning, semi-supervised learning); heterogeneous distillation will become an important topic in the multi-architecture era. The 0.6B parameter model can run on a single GPU/edge device, and its HumanEval score of 48.78 meets the needs of programming assistance, reducing deployment costs and promoting the transition of dLLMs from the laboratory to production.