Reading

UniDDT: A Novel Decoupled Diffusion Transformer Architecture for Unified Multimodal Understanding and Generation

Nanjing University and ByteDance Seed Team jointly propose UniDDT, which achieves high-quality multimodal understanding and generation simultaneously in a unified visual space through a Noisy ViT encoder and a decoupled diffusion decoder, and delivers leading performance on benchmarks such as GenEval and MME.

多模态模型扩散模型视觉理解视觉生成TransformerUniDDTunified multimodal modeldiffusion transformer

Published 2026-06-15 13:57Recent activity 2026-06-16 12:20Estimated read 6 min

Section 01

[Introduction] UniDDT: A Novel Decoupled Diffusion Transformer Architecture for Unified Multimodal Understanding and Generation

Nanjing University, ByteDance Seed Team, and the University of Hong Kong jointly propose the UniDDT architecture. It achieves high-quality multimodal understanding and generation in a unified visual space through a Noisy ViT encoder, an LLM backbone network, and a decoupled diffusion decoder. The model has achieved leading performance on authoritative benchmarks like GenEval (generation) and MME (understanding), and its open-source code has been released (https://github.com/MCG-NJU/UniDDT).

Section 02

Research Background: Existing Unified Multimodal Models Face Three Core Challenges

Unified Multimodal Models (UMM) need to integrate visual understanding and generation capabilities, but existing solutions have the following problems:

Modeling Conflict: Understanding focuses on high-level semantics, while generation requires fine-grained pixel details. Differences in objective functions and feature representations lead to conflicts in joint training;
Fragmented Visual Space: Understanding uses a high-dimensional semantic space, while generation uses a VAE latent space, increasing complexity and hindering expansion;
Insufficient Data Utilization: The image-text duality is not fully utilized, and the same data is not used for both understanding and generation training.

Section 03

Core Architectural Innovations: Unified Semantic Extraction and Decoupled Generation Design

Three key innovations of UniDDT:

Noisy ViT Encoder: Processes noisy inputs and unifies semantic encoding for understanding (clean images) and generation (noisy latent variables);
LLM Backbone Network: Distinguishes tasks via prompt templates and enables bidirectional semantic interaction between text and vision;
Decoupled Diffusion Decoder: Optimized specifically for generation tasks to avoid interference with text decoding;
Chooses VAE latent space as the unified visual representation to balance understanding and generation performance.

Section 04

Training Strategy: Three-Stage Progressive Optimization Ensures Stability and Performance

A phased training approach is adopted to avoid model collapse:

Preheating Phase: Pretrain the Noisy ViT (on understanding data) and diffusion decoder (on generation data) separately;
Joint Training: Unfreeze all modules, use image-text dual data to construct understanding/generation samples, and promote mutual enhancement of tasks;
Post-Training Phase: Fine-tune for specific tasks to improve benchmark performance.

Section 05

Experimental Results: Leading Understanding and Generation Capabilities Validated on Multiple Benchmarks

Performance on authoritative benchmarks:

Generation Tasks: GenEval overall score 0.87, DPG overall score 86.9;
Understanding Tasks: MME perception score 1699.5, SEEDbench overall score 76.5; Conclusion: There is no performance loss between the two tasks; instead, they mutually promote each other.

Section 06

Ablation Experiments: Validation of the Effectiveness of Key Design Choices

Ablation experiments prove:

Noisy ViT Preheating: Direct joint training leads to collapse, while preheating significantly stabilizes optimization;
Decoupled Design: Compared to full parameter sharing, the decoupled diffusion decoder improves generation quality while maintaining understanding performance;
Dual Data Structure: Using image-text duality to construct data consistently improves performance.

Section 07

Technical Significance and Future Outlook: Pointing the Way for UMM Development

Technical Significance:

Breaks the cognition of "either understanding or generation"—a single model achieves both high-quality capabilities;
Noisy ViT provides a new noise-robust idea for visual representation learning;
Decoupled unified design concept: Moderate task-specific optimization is better than full parameter sharing; Outlook: The open-source implementation provides a strong baseline for the community and推动 further development of UMM.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23