Through systematic comparative experiments, the research team revealed the following key insights:
1. High-level Semantic Tasks Dominate Performance
In all understanding benchmark tests, segmentation tasks consistently outperform mid-level (depth estimation) and low-level (edge detection) tasks. This finding verifies the alignment between high-level supervision and perception needs, while texture-oriented tasks instead introduce irrelevant interference.
2. Visual Supervision Enhances Perception but Does Not Affect Reasoning
Generative tuning significantly improves the performance of vision-centric tasks, such as spatial reasoning and hallucination resistance, but math/diagram reasoning abilities remain largely unaffected. This indicates that visual supervision can improve representation quality but does not endow the model with additional logical priors.
3. Universal Improvement in Spatial Fidelity
Regardless of semantic granularity, all proxy tasks improve the spatial fidelity of generation, especially for position-sensitive prompts. The process of reconstructing visual structures forces the model to learn accurate spatial layouts.