Zing Forum

Reading

How to Teach AI to 'Visual Think'? New Breakthrough in Cross-View Spatial Reasoning

The research team proposed the View Drop (VDrop) training method and panoramic visual thinking strategy, solving key challenges of vision-language models in cross-view spatial reasoning and achieving state-of-the-art out-of-domain generalization performance.

视觉语言模型空间推理视觉思考统一多模态模型跨视角推理视图丢弃全景渲染
Published 2026-05-27 01:20Recent activity 2026-05-27 12:54Estimated read 6 min
How to Teach AI to 'Visual Think'? New Breakthrough in Cross-View Spatial Reasoning
1

Section 01

[Introduction] How to Teach AI to Visual Think? New Breakthrough in Cross-View Spatial Reasoning

The research team proposed the View Drop (VDrop) training method and panoramic visual thinking strategy, solving the key problem where vision-language models (VLMs) rely on language and lose fine-grained geometric information in cross-view spatial reasoning, and achieving the best out-of-domain generalization performance.

Source: Paper published on arXiv on May 26, 2026, titled "How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning" (link: http://arxiv.org/abs/2605.27310v1)

2

Section 02

Problem Background: Dilemma in Cross-View Spatial Reasoning

Vision-language models (VLMs) perform well on many tasks, but have obvious shortcomings in cross-view spatial reasoning. Cross-view spatial reasoning refers to understanding the correspondence between different views of the same spatial scene (e.g., judging whether two room photos are of the same space, inferring the position of an object in another view). Current models mainly rely on language reasoning, losing the fine-grained geometric information required for the task and struggling to capture complex 3D spatial relationships.

3

Section 03

Challenges of Visual Thinking and Advantages of UMMs Architecture

Researchers proposed the concept of "visual thinking" (generating intermediate thinking images to assist reasoning), but models often ignore visual evidence in thinking images. Unified Multimodal Models (UMMs) natively support interleaved image-text generation without switching modules, providing a more natural foundation for visual thinking.

4

Section 04

VDrop Training Method: Forcing Models to Utilize Visual Thinking

View Drop (VDrop) is an intervention method during training. Its core idea is: retain all input views when generating thinking images, and randomly hide some input views when generating the final answer, forcing the model to rely on thinking images to recover hidden information. Training steps:

  1. Receive multi-view input images;
  2. All views are visible when generating thinking images;
  3. Hide some views when generating answers;
  4. Infer hidden information through thinking images.
5

Section 05

Choice of Thinking Images: Trade-off Between Learnability and Informativeness

The research team compared three thinking image variants:

  1. Bird's-eye rendering: Contains rich spatial information but is abstract, making it difficult to correspond with input views;
  2. Panoramic rendering: 360-degree panorama preserves complete visual context, balancing spatial information and visual continuity;
  3. Point matching rendering: Concrete but sparse, making it hard to support complex reasoning.
6

Section 06

Experimental Results: Superiority of Panoramic Visual Thinking

After training on synthetic scenes, evaluation was conducted on five real-world out-of-domain benchmarks: Panoramic visual thinking with VDrop is the only configuration that balances informativeness and learnability, achieving the best out-of-domain generalization (performing well even on unseen real scenes). Bird's-eye rendering has high informativeness but low learnability, while point matching rendering is learnable but lacks informativeness.

7

Section 07

Research Implications and Future Directions

Implications: Visual thinking can improve spatial reasoning ability; training interventions (such as VDrop) can guide model behavior; there is a need to balance the learnability and informativeness of representations; out-of-domain generalization is important for practical applications. Limitations: Relies on synthetic data, high computational cost, strong task specificity. Future Directions: Explore other thinking image representations; extend VDrop to other tasks; train on real data; combine visual and language thinking.