# UniDDT: A Novel Decoupled Diffusion Transformer Architecture for Unified Multimodal Understanding and Generation

> Nanjing University and ByteDance Seed Team jointly propose UniDDT, which achieves high-quality multimodal understanding and generation simultaneously in a unified visual space through a Noisy ViT encoder and a decoupled diffusion decoder, and delivers leading performance on benchmarks such as GenEval and MME.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-15T05:57:40.000Z
- 最近活动: 2026-06-16T04:20:52.562Z
- 热度: 128.6
- 关键词: 多模态模型, 扩散模型, 视觉理解, 视觉生成, Transformer, UniDDT, unified multimodal model, diffusion transformer
- 页面链接: https://www.zingnex.cn/en/forum/thread/uniddt-transformer
- Canonical: https://www.zingnex.cn/forum/thread/uniddt-transformer
- Markdown 来源: floors_fallback

---

## [Introduction] UniDDT: A Novel Decoupled Diffusion Transformer Architecture for Unified Multimodal Understanding and Generation

Nanjing University, ByteDance Seed Team, and the University of Hong Kong jointly propose the UniDDT architecture. It achieves high-quality multimodal understanding and generation in a unified visual space through a Noisy ViT encoder, an LLM backbone network, and a decoupled diffusion decoder. The model has achieved leading performance on authoritative benchmarks like GenEval (generation) and MME (understanding), and its open-source code has been released (https://github.com/MCG-NJU/UniDDT).

## Research Background: Existing Unified Multimodal Models Face Three Core Challenges

Unified Multimodal Models (UMM) need to integrate visual understanding and generation capabilities, but existing solutions have the following problems:
1. **Modeling Conflict**: Understanding focuses on high-level semantics, while generation requires fine-grained pixel details. Differences in objective functions and feature representations lead to conflicts in joint training;
2. **Fragmented Visual Space**: Understanding uses a high-dimensional semantic space, while generation uses a VAE latent space, increasing complexity and hindering expansion;
3. **Insufficient Data Utilization**: The image-text duality is not fully utilized, and the same data is not used for both understanding and generation training.

## Core Architectural Innovations: Unified Semantic Extraction and Decoupled Generation Design

Three key innovations of UniDDT:
- **Noisy ViT Encoder**: Processes noisy inputs and unifies semantic encoding for understanding (clean images) and generation (noisy latent variables);
- **LLM Backbone Network**: Distinguishes tasks via prompt templates and enables bidirectional semantic interaction between text and vision;
- **Decoupled Diffusion Decoder**: Optimized specifically for generation tasks to avoid interference with text decoding;
- Chooses VAE latent space as the unified visual representation to balance understanding and generation performance.

## Training Strategy: Three-Stage Progressive Optimization Ensures Stability and Performance

A phased training approach is adopted to avoid model collapse:
1. **Preheating Phase**: Pretrain the Noisy ViT (on understanding data) and diffusion decoder (on generation data) separately;
2. **Joint Training**: Unfreeze all modules, use image-text dual data to construct understanding/generation samples, and promote mutual enhancement of tasks;
3. **Post-Training Phase**: Fine-tune for specific tasks to improve benchmark performance.

## Experimental Results: Leading Understanding and Generation Capabilities Validated on Multiple Benchmarks

Performance on authoritative benchmarks:
- **Generation Tasks**: GenEval overall score 0.87, DPG overall score 86.9;
- **Understanding Tasks**: MME perception score 1699.5, SEEDbench overall score 76.5;
Conclusion: There is no performance loss between the two tasks; instead, they mutually promote each other.

## Ablation Experiments: Validation of the Effectiveness of Key Design Choices

Ablation experiments prove:
1. **Noisy ViT Preheating**: Direct joint training leads to collapse, while preheating significantly stabilizes optimization;
2. **Decoupled Design**: Compared to full parameter sharing, the decoupled diffusion decoder improves generation quality while maintaining understanding performance;
3. **Dual Data Structure**: Using image-text duality to construct data consistently improves performance.

## Technical Significance and Future Outlook: Pointing the Way for UMM Development

Technical Significance:
- Breaks the cognition of "either understanding or generation"—a single model achieves both high-quality capabilities;
- Noisy ViT provides a new noise-robust idea for visual representation learning;
- Decoupled unified design concept: Moderate task-specific optimization is better than full parameter sharing;
Outlook: The open-source implementation provides a strong baseline for the community and推动 further development of UMM.