Zing Forum

Reading

CrossMath: Do Vision-Language Models Truly Possess Visual Reasoning Capabilities?

A systematic study on the cross-modal reasoning capability gap of vision-language models, which reveals the essential differences between text and visual modalities in reasoning tasks through controlled variable experiments.

视觉语言模型多模态推理基准测试模态差距CrossMathVLM评估人工智能研究
Published 2026-04-20 19:33Recent activity 2026-04-20 19:52Estimated read 7 min
CrossMath: Do Vision-Language Models Truly Possess Visual Reasoning Capabilities?
1

Section 01

CrossMath: Do Vision-Language Models Truly Possess Visual Reasoning Capabilities?

Core Insights Summary

CrossMath is a new multimodal reasoning benchmark proposed by the team from Nanyang Technological University, Singapore, aiming to systematically study the cross-modal reasoning capability gap of Vision-Language Models (VLMs). Through controlled variable cross-modal comparison experiments, it reveals the essential differences between text and visual modalities in reasoning tasks—VLMs' reasoning accuracy when processing visual inputs is significantly lower than that of equivalent text inputs, indicating an obvious modal gap. This study is of great significance for understanding the capability boundaries of VLMs and guiding the direction of future model improvements.

2

Section 02

Research Background: The Myth of Multimodal Reasoning

Research Background: The Myth of Multimodal Reasoning

Vision-Language Models (VLMs) have made significant progress in recent years, from image-text alignment to complex reasoning, seemingly having "understood" visual information. However, a core question remains unresolved: Do VLMs rely on the visual information itself during reasoning, or do they only use the text clues implied in images? This question is crucial for clarifying the capability boundaries of VLMs—if reasoning is mainly based on text, "visual reasoning" may just be an illusion, and visual input only provides additional text context.

3

Section 03

Design Philosophy of the CrossMath Benchmark

Design Philosophy of the CrossMath Benchmark

The core design concept of CrossMath is controlled variable cross-modal comparison. Traditional multimodal benchmarks struggle to distinguish whether a model performs true visual reasoning or uses text information from images. CrossMath directly compares the performance differences of models under pure text and visual inputs by creating mathematically reasoning tasks that are equivalent in text and visual modalities but different in form, eliminating modal confusion factors.

4

Section 04

Experimental Design and Methodology

Experimental Design and Methodology

CrossMath uses multiple image style variants to test model robustness:

  • Original Style: Standard math problem images
  • Without Border: Remove borders to test spatial boundary dependence
  • With Significant Background: Disturbing elements like beige backgrounds
  • Change Font and Color: Alter text font and color to test dependence on specific visual features By comparing performance under different visual conditions, the bottlenecks of model reasoning are identified.
5

Section 05

Core Finding: Significant Modal Gap Exists

Core Finding: Significant Modal Gap Exists

Core conclusion of the study: There is a significant gap between visual and text modalities in reasoning tasks. The reasoning accuracy of VLMs when processing visual inputs is significantly lower than that of equivalent text inputs. This indicates that although VLMs are trained on a large number of image-text pairs, they have not achieved cross-modal equivalent reasoning capabilities, and the visual encoding stage may lose key information for reasoning.

6

Section 06

Technical Implementation and Open-Source Contributions

Technical Implementation and Open-Source Contributions

CrossMath provides a complete benchmark dataset (uploaded to Hugging Face), an open-source evaluation framework, and reasoning code. It supports multiple evaluation modes: pure image (image), hybrid (hybrid), and pure text (text); it also supports LoRA adapter loading for easy post-fine-tuning evaluation. Code features include batch reasoning, multi-sequence generation (num_return_sequence), and detailed logging, reducing the threshold for reproduction.

7

Section 07

Implications and Recommendations for VLM Development

Implications and Recommendations for VLM Development

  1. Cognitive Implication: Do not overinterpret the "visual understanding" capability of VLMs; they rely more on text clues.
  2. Improvement Directions: Need better visual encoders (preserving key details for reasoning), stronger cross-modal alignment mechanisms (semantically equivalent representations), and specialized training strategies (strengthening extraction of visual reasoning clues).
  3. Evaluation Dimension: Future evaluations should focus on cross-modal consistency—a truly powerful VLM needs to perform similarly under text and visual inputs. CrossMath provides an important epistemological tool for multimodal AI research, helping to understand the capability boundaries of models and guiding the development of more reliable and general AI systems.