Section 01
MMPhysVideo: Guide to Enhancing Physical Plausibility of Video Generation via Joint Multimodal Modeling
MMPhysVideo addresses the physical inconsistency issue in Video Diffusion Models (VDMs) by proposing a joint multimodal modeling approach: it unifies semantic, geometric, and spatiotemporal trajectory perceptual cues into a pseudo-RGB format, uses a bidirectional control teacher architecture to decouple RGB and perceptual processing, and achieves efficient inference via knowledge distillation. This method simultaneously improves both physical plausibility and visual quality of video generation across multiple benchmarks, providing a new paradigm for solving the physical consistency dilemma in video generation.