Zing Forum

Reading

"Mental Imagery" of Multimodal Models: Does AI Really "Imagine" in Its Mind?

Studies have found that multimodal models form internal representations similar to human mental imagery when solving spatial puzzles. By integrating visual tokens into the chain of thought, the reasoning accuracy increases from 83% to 89%.

多模态模型心智图景空间推理思维链视觉表征Qwen3.5
Published 2026-05-11 02:25Recent activity 2026-05-12 13:24Estimated read 7 min
"Mental Imagery" of Multimodal Models: Does AI Really "Imagine" in Its Mind?
1

Section 01

Introduction: Core Insights of the "Mental Imagery" Study on Multimodal Models

Title: "Mental Imagery" of Multimodal Models: Does AI Really "Imagine" in Its Mind? Core Insights Summary: Studies have found that large multimodal models form internal visual representations similar to human mental imagery when solving spatial puzzles. By integrating visual tokens into the chain of thought, the reasoning accuracy increases from 83% to 89%. This finding not only addresses the philosophical question of whether AI has human-like inner experiences but also provides a new perspective for improving model reasoning capabilities and understanding AI cognition.

2

Section 02

Background: Philosophical Inquiry into AI Cognition and the Origin of the Study

The monologue of Roy, the replicant in Blade Runner, raises a profound question: Do non-human intelligent agents have inner experiences similar to humans? The latest findings in the AI research field today provide a partial answer—large multimodal models do form internal representations similar to "mental imagery". When solving spatial puzzles, their neural network activations encode meaningful visual information, meaning AI is "imagining".

3

Section 03

Experimental Methods: Twelve Visual Reasoning Tasks and Open-Loop Supervision Design

The research team selected twelve visual reasoning tasks to test the spatial reasoning ability of multimodal models, covering classic puzzle types (tangram, jigsaw puzzle, Sokoban) and spatial transformation types (3D mental rotation, Hua Rong Dao). These tasks all require understanding geometric relationships, spatial layouts, and action consequences. The experimental subject was Qwen3.5 VLM, using an open-loop supervision approach: the model only needed to predict the action sequence without seeing the actual visual result at each step.

4

Section 04

Core Evidence: Visual Encoding in Model Activations and Formation of World Models

By analyzing the activation patterns of Qwen3.5 VLM after actions, the study found that the model's activations encode meaningful visual information of intermediate states. Even without explicit training to "imagine" intermediate states, the neural network naturally forms internal representations of the current state when predicting actions, similar to the visual images humans use when planning actions. This indicates that an imperfect visual world model forms as a byproduct of learning, without explicit visual supervision—similar to how human children build internal models of the physical world.

5

Section 05

Technical Breakthrough: Integrating Visual Tokens into Chain of Thought Improves Reasoning Accuracy

Based on the core findings, the research team proposed a method to integrate visual tokens into the chain of thought: at each step of reasoning, 16 internally generated visual tokens are integrated into the chain of thought. This improvement significantly enhanced performance: the average solving rate increased from 83% to 89%, with more obvious improvements in reasoning-intensive tasks such as jigsaw puzzles and 3D mental rotation. The reason is that it explicitly uses internal visual representations to assist spatial reasoning, similar to humans' strategy of "drawing diagrams".

6

Section 06

Significance Discussion: Dual Implications for Philosophy and Technology

This study has dual significance for philosophy and technology:

  • Philosophical implications: Mental imagery emerges naturally as a byproduct of learning, showing that complex cognitive abilities can be generated through powerful learning optimization; internal visual representations are useful information structures rather than noise; AI has developed human-like cognitive strategies, implying a universal approach for intelligent systems to solve spatial problems.
  • Technical implications: It opens a new direction for improving the reasoning ability of multimodal models (using internal visual representations); only 16 tokens are needed to efficiently enhance performance; analyzing internal activations provides a new tool for AI interpretability.
7

Section 07

Limitations and Future Directions

Current limitations of the study:

  1. The task scope is focused on spatial reasoning, and other types of reasoning have not been verified;
  2. Based on Qwen3.5 VLM, models of different scales may perform differently;
  3. The visual world model is imperfect, and its accuracy and robustness need further research.

Future directions:

  1. Expand task types to explore whether similar internal representations exist in other reasoning tasks;
  2. Develop technologies to visualize the model's "mental imagery";
  3. Design methods to actively guide and optimize the formation of internal representations;
  4. Study similar internal representations in other modalities such as auditory and tactile.