Section 01
[Introduction] Research on Dynamic Scene Understanding of Multimodal Large Language Models: Dyn-Bench Benchmark and Key Findings
This article introduces a groundbreaking study accepted by CVPR 2026, which proposes the Dyn-Bench benchmark to systematically evaluate, for the first time, the ability of multimodal large language models (MLLMs) to perceive, track, and reason about spatiotemporal dynamics in the 4D physical world. It reveals key limitations of current models in dynamic scene understanding and directions for improvement.