Reading

"Mental Imagery" of Multimodal Models: Does AI Really "Imagine" in Its Mind?

Studies have found that multimodal models form internal representations similar to human mental imagery when solving spatial puzzles. By integrating visual tokens into the chain of thought, the reasoning accuracy increases from 83% to 89%.

多模态模型心智图景空间推理思维链视觉表征Qwen3.5

Published 2026-05-11 02:25Recent activity 2026-05-12 13:24Estimated read 7 min

"Mental Imagery" of Multimodal Models: Does AI Really "Imagine" in Its Mind?

Section 01

Introduction: Core Insights of the "Mental Imagery" Study on Multimodal Models

Title: "Mental Imagery" of Multimodal Models: Does AI Really "Imagine" in Its Mind? Core Insights Summary: Studies have found that large multimodal models form internal visual representations similar to human mental imagery when solving spatial puzzles. By integrating visual tokens into the chain of thought, the reasoning accuracy increases from 83% to 89%. This finding not only addresses the philosophical question of whether AI has human-like inner experiences but also provides a new perspective for improving model reasoning capabilities and understanding AI cognition.

Section 02

Background: Philosophical Inquiry into AI Cognition and the Origin of the Study

The monologue of Roy, the replicant in Blade Runner, raises a profound question: Do non-human intelligent agents have inner experiences similar to humans? The latest findings in the AI research field today provide a partial answer—large multimodal models do form internal representations similar to "mental imagery". When solving spatial puzzles, their neural network activations encode meaningful visual information, meaning AI is "imagining".

Section 03

Experimental Methods: Twelve Visual Reasoning Tasks and Open-Loop Supervision Design

The research team selected twelve visual reasoning tasks to test the spatial reasoning ability of multimodal models, covering classic puzzle types (tangram, jigsaw puzzle, Sokoban) and spatial transformation types (3D mental rotation, Hua Rong Dao). These tasks all require understanding geometric relationships, spatial layouts, and action consequences. The experimental subject was Qwen3.5 VLM, using an open-loop supervision approach: the model only needed to predict the action sequence without seeing the actual visual result at each step.

Section 04

Core Evidence: Visual Encoding in Model Activations and Formation of World Models

By analyzing the activation patterns of Qwen3.5 VLM after actions, the study found that the model's activations encode meaningful visual information of intermediate states. Even without explicit training to "imagine" intermediate states, the neural network naturally forms internal representations of the current state when predicting actions, similar to the visual images humans use when planning actions. This indicates that an imperfect visual world model forms as a byproduct of learning, without explicit visual supervision—similar to how human children build internal models of the physical world.

Section 05

Technical Breakthrough: Integrating Visual Tokens into Chain of Thought Improves Reasoning Accuracy

Based on the core findings, the research team proposed a method to integrate visual tokens into the chain of thought: at each step of reasoning, 16 internally generated visual tokens are integrated into the chain of thought. This improvement significantly enhanced performance: the average solving rate increased from 83% to 89%, with more obvious improvements in reasoning-intensive tasks such as jigsaw puzzles and 3D mental rotation. The reason is that it explicitly uses internal visual representations to assist spatial reasoning, similar to humans' strategy of "drawing diagrams".

Section 06

Significance Discussion: Dual Implications for Philosophy and Technology

This study has dual significance for philosophy and technology:

Philosophical implications: Mental imagery emerges naturally as a byproduct of learning, showing that complex cognitive abilities can be generated through powerful learning optimization; internal visual representations are useful information structures rather than noise; AI has developed human-like cognitive strategies, implying a universal approach for intelligent systems to solve spatial problems.
Technical implications: It opens a new direction for improving the reasoning ability of multimodal models (using internal visual representations); only 16 tokens are needed to efficiently enhance performance; analyzing internal activations provides a new tool for AI interpretability.

Section 07

Limitations and Future Directions

Current limitations of the study:

The task scope is focused on spatial reasoning, and other types of reasoning have not been verified;
Based on Qwen3.5 VLM, models of different scales may perform differently;
The visual world model is imperfect, and its accuracy and robustness need further research.

Future directions:

Expand task types to explore whether similar internal representations exist in other reasoning tasks;
Develop technologies to visualize the model's "mental imagery";
Design methods to actively guide and optimize the formation of internal representations;
Study similar internal representations in other modalities such as auditory and tactile.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15