Zing Forum

Reading

VidGround: The Approach to Data Filtering for Visually Grounded Post-Training

Studies have found that 40-60% of questions in mainstream video understanding benchmarks can be answered using only text clues. VidGround improves model performance by 6.2 points using just 69% of the data through post-training on filtered data that truly requires visual grounding.

视觉语言模型视频理解后训练数据筛选视觉 grounding强化学习数据质量
Published 2026-04-07 03:22Recent activity 2026-04-08 11:18Estimated read 6 min
VidGround: The Approach to Data Filtering for Visually Grounded Post-Training
1

Section 01

[Introduction] VidGround: Core Points of the Data Filtering Scheme Focused on Visual Grounding

The video understanding capability of Vision-Language Models (VLMs) has long lagged behind their text reasoning capability. Studies have found that 40-60% of the questions in mainstream video understanding benchmarks and post-training datasets can be answered using only text clues, making it difficult for models to truly learn video understanding. VidGround improves model performance by 6.2 percentage points using only 69% of the data through post-training on filtered data that truly requires visual grounding, combined with a reinforcement learning post-training algorithm, verifying the importance of data quality over quantity.

2

Section 02

Background: Hidden Biases in Video Understanding Benchmarks and Post-Training Data

The current VLM evaluation system has serious hidden biases: 40%-60% of the questions in mainstream long video understanding benchmarks are of the "text-solvable" type, which can be answered without watching the video. This not only leads to overestimation of model capabilities but also misleads optimization directions. More importantly, this bias is equally prevalent in widely used post-training datasets, making models rely on text clues rather than video understanding capabilities, which has become a core bottleneck restricting the improvement of VLM video understanding.

3

Section 03

VidGround Core Strategy: Filtering Visually Grounded Data

The core idea of VidGround is to eliminate text-solvable samples from post-training data and retain only questions that require visual grounding. Implementation is divided into two steps: 1) Identify "visually grounded" (dependent on video content) and "text-solvable" samples in the dataset through automated or manual methods; 2) Use only the former for post-training. This strategy is concise and efficient, requiring no complex algorithms or additional resources.

4

Section 04

Experimental Evidence: Dual Improvement in Data Efficiency and Performance

Experimental results show that when VidGround is combined with a reinforcement learning post-training algorithm, model performance improves by 6.2 percentage points while using only 69.1% of the original data. In addition, simple post-training using filtered data outperforms multiple complex post-training techniques using complete data, verifying the hypothesis that data quality is more important than quantity and providing a practical path for resource-constrained scenarios.

5

Section 05

Implications for VLM Development

The results of VidGround bring three implications: 1) Evaluation benchmarks need to be more rigorous to ensure testing of true visual understanding capabilities; 2) Data curation should become a standard part of the training process, prioritizing high-quality data over scale; 3) Improving video understanding needs to start from the data source and extend to fine-grained temporal reasoning tasks.

6

Section 06

Practical Applications and Future Directions

VidGround has strong practicality and scalability: The research team provides a project page (http://vidground.etuagi.com) for easy reproduction; the industry can optimize models by improving data filtering processes without reconstructing architectures. Future directions include extending to multimodality (e.g., audio-video joint understanding), developing automated visual grounding recognition algorithms, exploring fine-grained filtering strategies (global/local video understanding), and applying it to the pre-training phase.

7

Section 07

Conclusion: Data Quality is the Key to Video Understanding

VidGround reveals the decisive role of data quality (especially the degree of visual grounding) in the true capabilities of models. Through simple data filtering, it not only improves performance but also ensures that models learn true video understanding capabilities rather than text shortcuts. While pursuing large models, we should not ignore the foundation of data quality—as VidGround shows, "less is more".