Section 01
[Introduction] UniDDT: A Novel Decoupled Diffusion Transformer Architecture for Unified Multimodal Understanding and Generation
Nanjing University, ByteDance Seed Team, and the University of Hong Kong jointly propose the UniDDT architecture. It achieves high-quality multimodal understanding and generation in a unified visual space through a Noisy ViT encoder, an LLM backbone network, and a decoupled diffusion decoder. The model has achieved leading performance on authoritative benchmarks like GenEval (generation) and MME (understanding), and its open-source code has been released (https://github.com/MCG-NJU/UniDDT).