Zing Forum

Reading

UniDDT: A Novel Decoupled Diffusion Transformer Architecture for Unified Multimodal Understanding and Generation

Nanjing University and ByteDance Seed Team jointly propose UniDDT, which achieves high-quality multimodal understanding and generation simultaneously in a unified visual space through a Noisy ViT encoder and a decoupled diffusion decoder, and delivers leading performance on benchmarks such as GenEval and MME.

多模态模型扩散模型视觉理解视觉生成TransformerUniDDTunified multimodal modeldiffusion transformer
Published 2026-06-15 13:57Recent activity 2026-06-16 12:20Estimated read 6 min
UniDDT: A Novel Decoupled Diffusion Transformer Architecture for Unified Multimodal Understanding and Generation
1

Section 01

[Introduction] UniDDT: A Novel Decoupled Diffusion Transformer Architecture for Unified Multimodal Understanding and Generation

Nanjing University, ByteDance Seed Team, and the University of Hong Kong jointly propose the UniDDT architecture. It achieves high-quality multimodal understanding and generation in a unified visual space through a Noisy ViT encoder, an LLM backbone network, and a decoupled diffusion decoder. The model has achieved leading performance on authoritative benchmarks like GenEval (generation) and MME (understanding), and its open-source code has been released (https://github.com/MCG-NJU/UniDDT).

2

Section 02

Research Background: Existing Unified Multimodal Models Face Three Core Challenges

Unified Multimodal Models (UMM) need to integrate visual understanding and generation capabilities, but existing solutions have the following problems:

  1. Modeling Conflict: Understanding focuses on high-level semantics, while generation requires fine-grained pixel details. Differences in objective functions and feature representations lead to conflicts in joint training;
  2. Fragmented Visual Space: Understanding uses a high-dimensional semantic space, while generation uses a VAE latent space, increasing complexity and hindering expansion;
  3. Insufficient Data Utilization: The image-text duality is not fully utilized, and the same data is not used for both understanding and generation training.
3

Section 03

Core Architectural Innovations: Unified Semantic Extraction and Decoupled Generation Design

Three key innovations of UniDDT:

  • Noisy ViT Encoder: Processes noisy inputs and unifies semantic encoding for understanding (clean images) and generation (noisy latent variables);
  • LLM Backbone Network: Distinguishes tasks via prompt templates and enables bidirectional semantic interaction between text and vision;
  • Decoupled Diffusion Decoder: Optimized specifically for generation tasks to avoid interference with text decoding;
  • Chooses VAE latent space as the unified visual representation to balance understanding and generation performance.
4

Section 04

Training Strategy: Three-Stage Progressive Optimization Ensures Stability and Performance

A phased training approach is adopted to avoid model collapse:

  1. Preheating Phase: Pretrain the Noisy ViT (on understanding data) and diffusion decoder (on generation data) separately;
  2. Joint Training: Unfreeze all modules, use image-text dual data to construct understanding/generation samples, and promote mutual enhancement of tasks;
  3. Post-Training Phase: Fine-tune for specific tasks to improve benchmark performance.
5

Section 05

Experimental Results: Leading Understanding and Generation Capabilities Validated on Multiple Benchmarks

Performance on authoritative benchmarks:

  • Generation Tasks: GenEval overall score 0.87, DPG overall score 86.9;
  • Understanding Tasks: MME perception score 1699.5, SEEDbench overall score 76.5; Conclusion: There is no performance loss between the two tasks; instead, they mutually promote each other.
6

Section 06

Ablation Experiments: Validation of the Effectiveness of Key Design Choices

Ablation experiments prove:

  1. Noisy ViT Preheating: Direct joint training leads to collapse, while preheating significantly stabilizes optimization;
  2. Decoupled Design: Compared to full parameter sharing, the decoupled diffusion decoder improves generation quality while maintaining understanding performance;
  3. Dual Data Structure: Using image-text duality to construct data consistently improves performance.
7

Section 07

Technical Significance and Future Outlook: Pointing the Way for UMM Development

Technical Significance:

  • Breaks the cognition of "either understanding or generation"—a single model achieves both high-quality capabilities;
  • Noisy ViT provides a new noise-robust idea for visual representation learning;
  • Decoupled unified design concept: Moderate task-specific optimization is better than full parameter sharing; Outlook: The open-source implementation provides a strong baseline for the community and推动 further development of UMM.