Section 01
Mamoda2.5: Introduction to the Unified Multimodal Understanding and Generation Framework Integrating DiT-MoE
Mamoda2.5 is a unified multimodal model that integrates Diffusion Transformer (DiT) and a fine-grained Mixture of Experts (MoE) architecture. It achieves efficient inference with only 3 billion parameters activated from a total of 25 billion, ranks top among open-source models in video generation and editing tasks, compresses the number of inference steps from 30 to 4 via distillation and reinforcement learning, and possesses both multimodal understanding and generation capabilities.