Zing Forum

Reading

DOSE: An Innovative Method for Screening High-Quality Multimodal Data Without Training

DOSE proposes a new method for screening multimodal training data using off-the-shelf pre-trained models (no fine-tuning on target data required). By constructing a joint quality-alignment distribution and adopting an adaptive weighted sampling strategy, this method selects information-rich samples while maintaining long-tail diversity, enabling models to achieve or surpass the performance of those trained with full data on VQA and math benchmarks.

多模态学习数据筛选视觉语言模型预训练模型自适应采样数据多样性训练效率
Published 2026-04-18 20:41Recent activity 2026-04-21 10:18Estimated read 5 min
DOSE: An Innovative Method for Screening High-Quality Multimodal Data Without Training
1

Section 01

DOSE: An Innovative Method for Screening High-Quality Multimodal Data Without Training [Main Floor Guide]

DOSE proposes a new method for screening multimodal training data using off-the-shelf pre-trained models (no fine-tuning on target data required). By constructing a joint quality-alignment distribution and adopting an adaptive weighted sampling strategy, it selects information-rich samples while maintaining long-tail diversity, enabling models to achieve or surpass the performance of those trained with full data on VQA and math benchmarks.

2

Section 02

Research Background and Challenges

In the training of Vision-Language Models (VLMs), high-quality and diverse multimodal data is crucial. However, existing datasets have issues such as noise, redundancy, and poor image-text alignment, which reduce learning efficiency and performance. Traditional data filtering methods require additional training of filtering models, consuming a lot of resources and forming the paradox of 'using training to screen data'.

3

Section 03

Core Idea of the DOSE Method

The core idea of DOSE (Data Selection via Off-the-shelf Models) is to use off-the-shelf pre-trained models that have not seen the target data to screen samples for larger multimodal models. Its insight is: even without fine-tuning, off-the-shelf pre-trained models can effectively evaluate text quality and image-text alignment, breaking the traditional perception that 'data screening requires specially trained models'.

4

Section 04

Technical Implementation Path of DOSE

  1. Construction of Joint Quality-Alignment Distribution: Consider both text quality and image-text alignment to comprehensively evaluate sample value;
  2. Adaptive Weighted Sampling Strategy: Balance the selection of information-rich samples and long-tail diversity to ensure rare and valuable samples are included;
  3. Advantage of No Training Required: Significantly reduce computational costs, plug-and-play, and scalable to any pre-trained model and task.
5

Section 05

Experimental Validation and Performance

The effectiveness of DOSE was verified in VQA and math reasoning benchmarks:

  • Models trained with screened data achieve or surpass the performance of those trained with full data;
  • Data diversity is significantly improved, enhancing model generalization ability;
  • Good efficiency and scalability, suitable for large-scale data processing.
6

Section 06

Significance and Implications of the DOSE Method

DOSE brings new ideas to the field of data screening: it proves that the knowledge of pre-trained models can be more fully utilized, and 'less is more' (the effect of carefully selected data is better than massive raw data). Practical application significance: reduce data preparation costs, improve model training efficiency, and promote data diversity to enhance robustness.

7

Section 07

Conclusion and Outlook

DOSE is an important advancement in data screening technology. It leverages the capabilities of off-the-shelf pre-trained models to screen high-quality multimodal data without additional training, enhancing diversity with excellent performance. In the future, it can be extended to more modal combinations and complex task scenarios, contributing to the development of multimodal large language models.