Section 01
VSTAT Benchmark: Diagnosis of Visual State Tracking Capabilities in Video Understanding for Multimodal Large Models
The VSTAT benchmark, released by the original author team (arXiv) on June 2, 2026, aims to diagnose the visual state tracking capabilities of multimodal large language models (MLLMs). The study found that although MLLMs excel at text reasoning, their visual perception capabilities are insufficient to effectively track changes in the state of entities in videos, and their performance on VSTAT is far below human levels. This benchmark fills a gap in the existing evaluation system and is of great significance for MLLM video understanding research and applications.
- Original paper link: http://arxiv.org/abs/2606.03920v1
- Release date: June 2, 2026