Zing Forum

Reading

VisionFoundry: Teaching Visual Perception to Vision-Language Models via Synthetic Images

VisionFoundry is a task-aware synthetic data generation pipeline that automatically produces questions, answers, and images with just a task name. The constructed VisionFoundry-10K dataset achieves significant improvements on visual perception benchmarks.

视觉语言模型合成数据生成视觉感知文本到图像生成视觉问答数据增强多模态学习
Published 2026-04-11 01:48Recent activity 2026-04-13 11:22Estimated read 7 min
VisionFoundry: Teaching Visual Perception to Vision-Language Models via Synthetic Images
1

Section 01

[Introduction] VisionFoundry: Enhancing Visual Perception of Vision-Language Models with Synthetic Data

VisionFoundry is a task-aware synthetic data generation pipeline that automatically generates questions, answers, and images using only a task name. The constructed VisionFoundry-10K dataset achieves significant improvements on visual perception benchmarks, providing an innovative data-driven approach to enhancing the perceptual capabilities of Vision-Language Models (VLMs).

2

Section 02

Research Background: Bottlenecks in VLM Visual Perception and Solutions

Vision-Language Models (VLMs) perform strongly across various tasks but still have limitations in visual perception tasks such as spatial understanding and viewpoint recognition. The core reason is that natural image datasets provide limited supervision signals for low-level visual skills, and relevant signals for specific perception tasks are easily overwhelmed by complex scenes. The study raises a key question: Can these weaknesses be addressed through targeted synthetic supervision data? Ideal synthetic data should be generated from task keywords (e.g., "depth ordering") without reference images or manual annotations, providing a scalable and controllable source of training data.

3

Section 03

Methodology: VisionFoundry System Architecture and VisionFoundry-10K Dataset Construction

VisionFoundry System Architecture

The core innovation of this pipeline is that it automatically generates multimodal training data with only a task name as input, consisting of four steps:

  1. A Large Language Model (LLM) generates task-related questions, answers, and Text-to-Image (T2I) prompts;
  2. A T2I model (e.g., Stable Diffusion) synthesizes images based on the prompts;
  3. A proprietary VLM verifies the consistency between images and question-answer pairs;
  4. Filter inconsistent samples and retain high-quality image-question-answer triples.

VisionFoundry-10K Dataset

Built based on the above pipeline, it contains 10,000 triples covering 10 visual perception tasks where VLMs underperform (e.g., depth ordering, viewpoint recognition), with approximately 1000 samples per task. During generation, the LLM handles scene descriptions and question variations, the T2I generates visual content, and the VLM validator controls quality.

4

Section 04

Experimental Evidence: Performance Improvements and Validation of Key Factors

Key Experimental Results

Models trained with VisionFoundry-10K show significant improvements on visual perception benchmarks:

  • MMVP (Multimodal Visual Perception Benchmark): 7% performance improvement;
  • CV-Bench-3D (3D Visual Understanding Benchmark): 10% performance improvement; Moreover, performance continues to improve as the amount of training data increases (scaling behavior), in contrast to the diminishing returns from training with natural data.

Ablation Study Analysis

  • Task-specific synthetic supervision is key: The improvement from general synthetic data is far less than that from task-targeted data;
  • Question diversity matters: Restricting the types of questions generated by the LLM reduces the model's generalization ability;
  • VLM validator is indispensable: Removing the validation step leads to decreased data quality and performance.
5

Section 05

Conclusions and Implications: Value of Synthetic Data for VLM Training

Conclusions

VisionFoundry generates high-quality training data through task-aware synthetic data generation without reference images or manual annotations, effectively enhancing the visual perception capabilities of VLMs and opening up new directions for VLM training.

Implications for VLM Training

  1. Natural data provides insufficient supervision signals for specific perceptual skills; synthetic data can supplement these in a targeted manner;
  2. Synthetic data is low-cost, highly controllable, and scalable, making it a promising path for VLM training;
  3. The combination of LLMs (high-level semantic planning) and T2I models (visual content generation) offers new possibilities for data generation.
6

Section 06

Limitations and Future Research Directions

Limitations

  1. Relies on a proprietary VLM for consistency verification, which may introduce biases due to the validator's own limitations;
  2. The image quality generated by T2I models still needs improvement in complex 3D scenes and fine-grained spatial relationship expression.

Future Work

  • Explore more robust verification mechanisms (e.g., cross-validation with multiple validators);
  • Expand to more visual tasks requiring complex reasoning;
  • Study hybrid training strategies with real data;
  • Optimize the efficiency of synthetic data generation and reduce costs.