Section 01
PGT Framework: A New Solution to Break the Bottleneck of Fine-Grained Visual Understanding in Multimodal Large Models
Multimodal Large Language Models (MLLMs) have made progress in tasks like image understanding, but they still have shortcomings in fine-grained visual understanding (e.g., spatial relationships, quantitative reasoning). The PGT (Procedurally Generated Tasks) framework enhances the model's fine-grained visual understanding ability through procedurally generated tasks. Experiments show it can improve performance by over 20%, and it can also serve as a diagnostic tool to identify the root causes of perceptual failures.