# Research Findings: Chain of Thought Impairs Visual-Spatial Reasoning Ability of Multimodal Large Models

> This paper, through evaluating 17 models on 13 spatial benchmarks, found that Chain of Thought (CoT) prompting instead reduces visual-spatial reasoning performance, and reveals that models have serious shortcut learning and visual hallucination issues.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-17T13:35:45.000Z
- 最近活动: 2026-04-20T02:26:52.409Z
- 热度: 97.2
- 关键词: 思维链, 空间推理, 多模态大模型, 捷径学习, 视觉幻觉, No-Image++, 视觉中心推理
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2604-16060v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2604-16060v1
- Markdown 来源: floors_fallback

---

## [Introduction] Core Findings: Chain of Thought Impairs Visual-Spatial Reasoning Ability of Multimodal Large Models

This paper, through evaluating 17 multimodal models on 13 spatial reasoning benchmarks, found that Chain of Thought (CoT) prompting instead reduces visual-spatial reasoning performance, and reveals that models have serious shortcut learning and visual hallucination issues. This counterintuitive finding challenges the universality of CoT in the multimodal domain and points the way for future research.

## Background: Application and Problems of Chain of Thought in Multimodal Reasoning

Chain of Thought (CoT) is an important technological breakthrough in the field of large language models, which significantly improves performance in tasks such as mathematics and logic through explicit reasoning steps. Multimodal Reasoning Models (MRMs) have extended it to the visual domain, achieving results in tasks like mathematical chart understanding and geometric problem solving. However, the latest research finds that CoT is not only unhelpful but also impairs model performance in visual-spatial reasoning.

## Research Design and Methods: Comprehensive Evaluation of Models and Benchmarks

The research team evaluated 17 multimodal models (including open-source ones like LLaVA, Qwen-VL; closed-source ones like GPT-4V, Gemini; and specialized MRMs) on 13 spatial reasoning benchmarks (covering 6 types of tasks: spatial relation reasoning, navigation, spatial questions in visual question answering, geometric reasoning, mental rotation, and spatial memory), and systematically compared the performance differences between CoT and non-CoT prompting.

## Core Findings: CoT Causes Decline in Spatial Reasoning Performance

In almost all spatial reasoning tasks, CoT prompting reduces accuracy by an average of 10-20%, with a larger decline in precise spatial localization tasks; even specialized MRMs show significantly weakened abilities after using CoT. The reasons include: limitations of language description (loss of precision when converting continuous space to discrete symbols), attention distraction (over-focusing on text and ignoring visual details), and misleading reasoning paths (amplification of wrong assumptions).

## No-Image++ Experiment: Revealing Shortcut Learning and Visual Hallucinations

The No-Image++ experiment (providing only question text without images) found that models using CoT can still give answers, exposing shortcut learning (relying on text priors rather than vision); there are also visual hallucinations (describing visual details out of thin air when there are no images), which is a byproduct of models maintaining the coherence of CoT reasoning.

## In-depth Analysis: Fundamental Reasons Why CoT Is Unsuitable for Spatial Reasoning

1. Representation difference: Space is a continuous geometric representation, while language is a discrete symbolic representation; CoT's use of symbols to handle spatial problems is mismatched. 2. Reasoning granularity mismatch: CoT's coarse-grained conceptual reasoning cannot capture the fine-grained geometric calculations required for space. 3. Training data bias: Strong text-answer correlation reinforces shortcut learning.

## Challenges to Existing Methods: MRMs, Evaluation, and Application Risks

1. Questioning MRMs design: The core CoT technology impairs spatial reasoning; advantages may come from scale rather than architecture. 2. Insufficient evaluation metrics: High scores may come from shortcuts; methods to detect real visual understanding are needed. 3. Application risks: Scenarios like autonomous driving rely on spatial decisions, which are prone to errors outside the distribution.

## Future Directions: Vision-Centered Reasoning Paradigm

Calls for development: 1. Vision-native reasoning architectures (integration of spatial relation modeling and geometric deep learning). 2. Hybrid reasoning strategies (combining CoT with vision-native methods). 3. Strict evaluation protocols (adversarial examples, out-of-distribution testing). 4. Interpretability research (understanding the information sources models rely on).
