# VisionFoundry: Teaching Visual Perception to Vision-Language Models via Synthetic Images

> VisionFoundry is a task-aware synthetic data generation pipeline that automatically produces questions, answers, and images with just a task name. The constructed VisionFoundry-10K dataset achieves significant improvements on visual perception benchmarks.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-10T17:48:51.000Z
- 最近活动: 2026-04-13T03:22:20.239Z
- 热度: 82.4
- 关键词: 视觉语言模型, 合成数据生成, 视觉感知, 文本到图像生成, 视觉问答, 数据增强, 多模态学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/visionfoundry
- Canonical: https://www.zingnex.cn/forum/thread/visionfoundry
- Markdown 来源: floors_fallback

---

## [Introduction] VisionFoundry: Enhancing Visual Perception of Vision-Language Models with Synthetic Data

VisionFoundry is a task-aware synthetic data generation pipeline that automatically generates questions, answers, and images using only a task name. The constructed VisionFoundry-10K dataset achieves significant improvements on visual perception benchmarks, providing an innovative data-driven approach to enhancing the perceptual capabilities of Vision-Language Models (VLMs).

## Research Background: Bottlenecks in VLM Visual Perception and Solutions

Vision-Language Models (VLMs) perform strongly across various tasks but still have limitations in visual perception tasks such as spatial understanding and viewpoint recognition. The core reason is that natural image datasets provide limited supervision signals for low-level visual skills, and relevant signals for specific perception tasks are easily overwhelmed by complex scenes. The study raises a key question: Can these weaknesses be addressed through targeted synthetic supervision data? Ideal synthetic data should be generated from task keywords (e.g., "depth ordering") without reference images or manual annotations, providing a scalable and controllable source of training data.

## Methodology: VisionFoundry System Architecture and VisionFoundry-10K Dataset Construction

### VisionFoundry System Architecture
The core innovation of this pipeline is that it automatically generates multimodal training data with only a task name as input, consisting of four steps:
1. A Large Language Model (LLM) generates task-related questions, answers, and Text-to-Image (T2I) prompts;
2. A T2I model (e.g., Stable Diffusion) synthesizes images based on the prompts;
3. A proprietary VLM verifies the consistency between images and question-answer pairs;
4. Filter inconsistent samples and retain high-quality image-question-answer triples.

### VisionFoundry-10K Dataset
Built based on the above pipeline, it contains 10,000 triples covering 10 visual perception tasks where VLMs underperform (e.g., depth ordering, viewpoint recognition), with approximately 1000 samples per task. During generation, the LLM handles scene descriptions and question variations, the T2I generates visual content, and the VLM validator controls quality.

## Experimental Evidence: Performance Improvements and Validation of Key Factors

### Key Experimental Results
Models trained with VisionFoundry-10K show significant improvements on visual perception benchmarks:
- MMVP (Multimodal Visual Perception Benchmark): 7% performance improvement;
- CV-Bench-3D (3D Visual Understanding Benchmark): 10% performance improvement;
Moreover, performance continues to improve as the amount of training data increases (scaling behavior), in contrast to the diminishing returns from training with natural data.

### Ablation Study Analysis
- Task-specific synthetic supervision is key: The improvement from general synthetic data is far less than that from task-targeted data;
- Question diversity matters: Restricting the types of questions generated by the LLM reduces the model's generalization ability;
- VLM validator is indispensable: Removing the validation step leads to decreased data quality and performance.

## Conclusions and Implications: Value of Synthetic Data for VLM Training

### Conclusions
VisionFoundry generates high-quality training data through task-aware synthetic data generation without reference images or manual annotations, effectively enhancing the visual perception capabilities of VLMs and opening up new directions for VLM training.

### Implications for VLM Training
1. Natural data provides insufficient supervision signals for specific perceptual skills; synthetic data can supplement these in a targeted manner;
2. Synthetic data is low-cost, highly controllable, and scalable, making it a promising path for VLM training;
3. The combination of LLMs (high-level semantic planning) and T2I models (visual content generation) offers new possibilities for data generation.

## Limitations and Future Research Directions

### Limitations
1. Relies on a proprietary VLM for consistency verification, which may introduce biases due to the validator's own limitations;
2. The image quality generated by T2I models still needs improvement in complex 3D scenes and fine-grained spatial relationship expression.

### Future Work
- Explore more robust verification mechanisms (e.g., cross-validation with multiple validators);
- Expand to more visual tasks requiring complex reasoning;
- Study hybrid training strategies with real data;
- Optimize the efficiency of synthetic data generation and reduce costs.
