Section 01
[Introduction] VisionFoundry: Enhancing Visual Perception of Vision-Language Models with Synthetic Data
VisionFoundry is a task-aware synthetic data generation pipeline that automatically generates questions, answers, and images using only a task name. The constructed VisionFoundry-10K dataset achieves significant improvements on visual perception benchmarks, providing an innovative data-driven approach to enhancing the perceptual capabilities of Vision-Language Models (VLMs).