Zing Forum

Reading

PGT: Breaking the Bottleneck of Fine-Grained Visual Understanding in Multimodal Large Models with Procedurally Generated Tasks

This article introduces the PGT (Procedurally Generated Tasks) framework, which enhances the fine-grained visual understanding ability of multimodal large language models through procedurally generated tasks. Experiments show that it can improve performance by more than 20%.

多模态大语言模型视觉理解细粒度感知数据增强空间推理MLLM计算机视觉深度学习
Published 2026-05-23 01:45Recent activity 2026-05-25 12:17Estimated read 6 min
PGT: Breaking the Bottleneck of Fine-Grained Visual Understanding in Multimodal Large Models with Procedurally Generated Tasks
1

Section 01

PGT Framework: A New Solution to Break the Bottleneck of Fine-Grained Visual Understanding in Multimodal Large Models

Multimodal Large Language Models (MLLMs) have made progress in tasks like image understanding, but they still have shortcomings in fine-grained visual understanding (e.g., spatial relationships, quantitative reasoning). The PGT (Procedurally Generated Tasks) framework enhances the model's fine-grained visual understanding ability through procedurally generated tasks. Experiments show it can improve performance by over 20%, and it can also serve as a diagnostic tool to identify the root causes of perceptual failures.

2

Section 02

Background: Challenges and Core Issues in Fine-Grained Visual Understanding

Current MLLMs perform poorly in fine-grained tasks such as spatial relationships, quantitative reasoning, and 3D depth understanding (e.g., difficulty answering "Is the cat on the left larger than the cat on the right?"). The traditional view attributes this to architectural limitations or insufficient resolution, but PGT research points out that the core issue is insufficient supervision signals—lack of enough fine-grained training data to learn precise visual localization capabilities.

3

Section 03

Methodology: Core Ideas and Technical Implementation of the PGT Framework

Core innovations of PGT: Generate dense supervision signals by overlaying geometric primitives (rectangles, circles, etc.) on images. Its functions include: 1. Decoupling visual localization and semantic priors; 2. Low-cost data augmentation; 3. Diagnostic tool. Technical implementation: Mix PGT data with the LLaVA-v1.5-Instruct dataset for instruction fine-tuning, covering tasks like spatial relationship understanding, quantitative reasoning, and 3D/depth perception. PGT does not change the model architecture, does not increase inference overhead, and is a pure data augmentation method.

4

Section 04

Evidence: Experimental Validation of PGT's Effectiveness

Experimental results demonstrate the effectiveness of PGT:

  • Base model (LLaVA-v1.5-Instruct + PGT): +20% improvement on the What'sUp benchmark, +13.3% improvement on CV-Bench-2D, while maintaining general perception capabilities;
  • Advanced model fine-tuning: +5.5% improvement on What'sUp, +8.3% on CV-Bench-2D. Even top models can benefit from PGT's fine-grained supervision.
5

Section 05

Conclusion: The Key Role of Supervision Signals and the Value of PGT

Key findings from PGT research: Many spatial reasoning defects stem from insufficient supervision signals, not architectural or resolution limitations. Practical implications: 1. Prioritize data engineering (first check if training data supervision is sufficient); 2. Low-cost improvement (no need for architectural changes); 3. Scalability (procedurally generated data, not limited by manual annotation costs).

6

Section 06

Implications: Practical Path for Multimodal AI Development

PGT validates a machine learning principle: The way to formalize a problem is more important than the solution. Redefining fine-grained visual understanding as a geometric primitive recognition task creates clear supervision signals. Implications for engineers/researchers: Adding PGT data to existing training processes can significantly improve model performance in tasks like spatial reasoning and quantitative comparison.

7

Section 07

Epilogue: The Simplicity and Elegance of PGT and Its Future Impact

PGT solves complex technical problems in a concise way, reminding us that effective solutions may lie in better data rather than more complex models. As MLLMs are applied in real-world scenarios, fine-grained visual understanding ability is key to model practicality, and PGT provides a low-cost and efficient solution.