Section 01
[Introduction] VidGround: Core Points of the Data Filtering Scheme Focused on Visual Grounding
The video understanding capability of Vision-Language Models (VLMs) has long lagged behind their text reasoning capability. Studies have found that 40-60% of the questions in mainstream video understanding benchmarks and post-training datasets can be answered using only text clues, making it difficult for models to truly learn video understanding. VidGround improves model performance by 6.2 percentage points using only 69% of the data through post-training on filtered data that truly requires visual grounding, combined with a reinforcement learning post-training algorithm, verifying the importance of data quality over quantity.