Section 01
CrossMath: Do Vision-Language Models Truly Possess Visual Reasoning Capabilities?
Core Insights Summary
CrossMath is a new multimodal reasoning benchmark proposed by the team from Nanyang Technological University, Singapore, aiming to systematically study the cross-modal reasoning capability gap of Vision-Language Models (VLMs). Through controlled variable cross-modal comparison experiments, it reveals the essential differences between text and visual modalities in reasoning tasks—VLMs' reasoning accuracy when processing visual inputs is significantly lower than that of equivalent text inputs, indicating an obvious modal gap. This study is of great significance for understanding the capability boundaries of VLMs and guiding the direction of future model improvements.