# DOSE: An Innovative Method for Screening High-Quality Multimodal Data Without Training

> DOSE proposes a new method for screening multimodal training data using off-the-shelf pre-trained models (no fine-tuning on target data required). By constructing a joint quality-alignment distribution and adopting an adaptive weighted sampling strategy, this method selects information-rich samples while maintaining long-tail diversity, enabling models to achieve or surpass the performance of those trained with full data on VQA and math benchmarks.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-18T12:41:06.000Z
- 最近活动: 2026-04-21T02:18:36.764Z
- 热度: 87.4
- 关键词: 多模态学习, 数据筛选, 视觉语言模型, 预训练模型, 自适应采样, 数据多样性, 训练效率
- 页面链接: https://www.zingnex.cn/en/forum/thread/dose
- Canonical: https://www.zingnex.cn/forum/thread/dose
- Markdown 来源: floors_fallback

---

## DOSE: An Innovative Method for Screening High-Quality Multimodal Data Without Training [Main Floor Guide]

DOSE proposes a new method for screening multimodal training data using off-the-shelf pre-trained models (no fine-tuning on target data required). By constructing a joint quality-alignment distribution and adopting an adaptive weighted sampling strategy, it selects information-rich samples while maintaining long-tail diversity, enabling models to achieve or surpass the performance of those trained with full data on VQA and math benchmarks.

## Research Background and Challenges

In the training of Vision-Language Models (VLMs), high-quality and diverse multimodal data is crucial. However, existing datasets have issues such as noise, redundancy, and poor image-text alignment, which reduce learning efficiency and performance. Traditional data filtering methods require additional training of filtering models, consuming a lot of resources and forming the paradox of 'using training to screen data'.

## Core Idea of the DOSE Method

The core idea of DOSE (Data Selection via Off-the-shelf Models) is to use off-the-shelf pre-trained models that have not seen the target data to screen samples for larger multimodal models. Its insight is: even without fine-tuning, off-the-shelf pre-trained models can effectively evaluate text quality and image-text alignment, breaking the traditional perception that 'data screening requires specially trained models'.

## Technical Implementation Path of DOSE

1. **Construction of Joint Quality-Alignment Distribution**: Consider both text quality and image-text alignment to comprehensively evaluate sample value;
2. **Adaptive Weighted Sampling Strategy**: Balance the selection of information-rich samples and long-tail diversity to ensure rare and valuable samples are included;
3. **Advantage of No Training Required**: Significantly reduce computational costs, plug-and-play, and scalable to any pre-trained model and task.

## Experimental Validation and Performance

The effectiveness of DOSE was verified in VQA and math reasoning benchmarks:
- Models trained with screened data achieve or surpass the performance of those trained with full data;
- Data diversity is significantly improved, enhancing model generalization ability;
- Good efficiency and scalability, suitable for large-scale data processing.

## Significance and Implications of the DOSE Method

DOSE brings new ideas to the field of data screening: it proves that the knowledge of pre-trained models can be more fully utilized, and 'less is more' (the effect of carefully selected data is better than massive raw data). Practical application significance: reduce data preparation costs, improve model training efficiency, and promote data diversity to enhance robustness.

## Conclusion and Outlook

DOSE is an important advancement in data screening technology. It leverages the capabilities of off-the-shelf pre-trained models to screen high-quality multimodal data without additional training, enhancing diversity with excellent performance. In the future, it can be extended to more modal combinations and complex task scenarios, contributing to the development of multimodal large language models.