Zing Forum

Reading

Uni-ViGU: A Unified Framework for Video Generation and Understanding Based on Diffusion Video Generators

This article introduces the Uni-ViGU framework, which unifies video generation and understanding by using a video generator as the basic architecture, adopting a unified flow matching method and a modality-driven MoE design, combined with a bidirectional training mechanism, and verifies the scalability of the generation-centric architecture.

视频生成多模态模型扩散模型视频理解统一架构流匹配
Published 2026-04-09 19:41Recent activity 2026-04-10 10:48Estimated read 7 min
Uni-ViGU: A Unified Framework for Video Generation and Understanding Based on Diffusion Video Generators
1

Section 01

Introduction: Core Innovations and Value of the Uni-ViGU Framework

This article introduces the Uni-ViGU framework, which unifies video generation and understanding by using a video generator as the basic architecture, adopting a unified flow matching method, modality-driven MoE design, and bidirectional training mechanism. It verifies the scalability of the generation-centric architecture and solves the computational dilemmas of traditional understanding-centric architectures.

2

Section 02

Background: Computational Dilemmas of Unified Multimodal Models

Current multimodal models have a fragmented trajectory between visual understanding and generation, where the computational cost of generation tasks is much higher than that of understanding (diffusion generation requires dozens to hundreds of iterative steps, while understanding only needs one step). Traditional understanding-centric architectures face limitations such as architectural mismatch (loss in converting discrete tokens to continuous latent spaces), conflicting optimization objectives (difficulty in balancing discriminative and generative features), and low computational efficiency (resource waste caused by adding generation capabilities).

3

Section 03

Method: Paradigm Reversal — Using Video Generator as the Architectural Cornerstone

Uni-ViGU reverses the traditional paradigm and uses a video diffusion generator as the basic architecture:

  • Rich generation prior: Diffusion models learn the complete distribution of video data and contain deep visual knowledge;
  • Advantages of continuous representation: Avoids the information bottleneck of discrete tokenization and adapts to high-dimensional video data;
  • Scalable architecture: Based on Transformer/DiT, performance continues to improve as the scale increases.
4

Section 04

Method: Unified Flow Matching and Modality-Driven MoE Design

Unified Flow Matching

  • Continuous flow matching: The video modality uses standard continuous flow transformation;
  • Discrete flow matching: The text modality innovatively introduces discrete flow transformation;
  • Collaborative generation: A single forward pass processes both video and text generation simultaneously, enabling multimodal joint modeling.

Modality-Driven MoE

  • Preserve generation core: Video generation parameters and paths remain unchanged;
  • Lightweight text experts: Inject small-parameter text layers;
  • Modality routing: Dynamically activate text layers and allocate resources on demand.
5

Section 05

Method: Bidirectional Training Mechanism — Bridge from Generation to Understanding

Knowledge Recall Phase

  • Reconstruct input prompts: Reconstruct generation prompts from video latent representations to learn visual-text correspondence;
  • Bidirectional correspondence learning: Establish bidirectional mappings from text to video and video to text.

Capability Refinement Phase

  • Detailed subtitle fine-tuning: Train with fine-grained subtitles to generate accurate descriptions;
  • Establish discriminative representations: Share features between generation and understanding to achieve bidirectional capabilities.
6

Section 06

Evidence: Verification of Dual Competitiveness in Generation and Understanding

  • Video generation performance: Comparable to or even better than specialized generation models;
  • Video understanding performance: Reaches competitive levels of specialized understanding models in tasks such as question answering and subtitle generation;
  • Scalability: As the model scale increases, both generation and understanding performance continue to improve without optimization dilemmas.
7

Section 07

Recommendations: Technical Insights and Future Research Directions

  • Paradigm selection: Generation as a basic architecture is more scalable;
  • Value of generation prior: Explore the general applications of diffusion model generation prior;
  • Bidirectional training innovation: Extend to other modality and task combinations.
8

Section 08

Conclusion: A New Scalable Path for Generation-Centric Architecture

Uni-ViGU achieves dual competitiveness of a single model in video generation and understanding through three innovations: paradigm reversal (generator as the foundation), unified flow matching, modality-driven MoE, and bidirectional training. The generation-centric architecture provides an important design choice for the next generation of unified multimodal systems, and the open-sourcing of the project will promote community exploration.