# How to Teach AI to 'Visual Think'? New Breakthrough in Cross-View Spatial Reasoning

> The research team proposed the View Drop (VDrop) training method and panoramic visual thinking strategy, solving key challenges of vision-language models in cross-view spatial reasoning and achieving state-of-the-art out-of-domain generalization performance.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-26T17:20:05.000Z
- 最近活动: 2026-05-27T04:54:06.047Z
- 热度: 137.4
- 关键词: 视觉语言模型, 空间推理, 视觉思考, 统一多模态模型, 跨视角推理, 视图丢弃, 全景渲染
- 页面链接: https://www.zingnex.cn/en/forum/thread/ai-23f6f525
- Canonical: https://www.zingnex.cn/forum/thread/ai-23f6f525
- Markdown 来源: floors_fallback

---

## [Introduction] How to Teach AI to Visual Think? New Breakthrough in Cross-View Spatial Reasoning

The research team proposed the View Drop (VDrop) training method and panoramic visual thinking strategy, solving the key problem where vision-language models (VLMs) rely on language and lose fine-grained geometric information in cross-view spatial reasoning, and achieving the best out-of-domain generalization performance.

Source: Paper published on arXiv on May 26, 2026, titled "How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning" (link: http://arxiv.org/abs/2605.27310v1)

## Problem Background: Dilemma in Cross-View Spatial Reasoning

Vision-language models (VLMs) perform well on many tasks, but have obvious shortcomings in cross-view spatial reasoning. Cross-view spatial reasoning refers to understanding the correspondence between different views of the same spatial scene (e.g., judging whether two room photos are of the same space, inferring the position of an object in another view). Current models mainly rely on language reasoning, losing the fine-grained geometric information required for the task and struggling to capture complex 3D spatial relationships.

## Challenges of Visual Thinking and Advantages of UMMs Architecture

Researchers proposed the concept of "visual thinking" (generating intermediate thinking images to assist reasoning), but models often ignore visual evidence in thinking images. Unified Multimodal Models (UMMs) natively support interleaved image-text generation without switching modules, providing a more natural foundation for visual thinking.

## VDrop Training Method: Forcing Models to Utilize Visual Thinking

View Drop (VDrop) is an intervention method during training. Its core idea is: retain all input views when generating thinking images, and randomly hide some input views when generating the final answer, forcing the model to rely on thinking images to recover hidden information. Training steps:
1. Receive multi-view input images;
2. All views are visible when generating thinking images;
3. Hide some views when generating answers;
4. Infer hidden information through thinking images.

## Choice of Thinking Images: Trade-off Between Learnability and Informativeness

The research team compared three thinking image variants:
1. Bird's-eye rendering: Contains rich spatial information but is abstract, making it difficult to correspond with input views;
2. Panoramic rendering: 360-degree panorama preserves complete visual context, balancing spatial information and visual continuity;
3. Point matching rendering: Concrete but sparse, making it hard to support complex reasoning.

## Experimental Results: Superiority of Panoramic Visual Thinking

After training on synthetic scenes, evaluation was conducted on five real-world out-of-domain benchmarks: Panoramic visual thinking with VDrop is the only configuration that balances informativeness and learnability, achieving the best out-of-domain generalization (performing well even on unseen real scenes). Bird's-eye rendering has high informativeness but low learnability, while point matching rendering is learnable but lacks informativeness.

## Research Implications and Future Directions

**Implications**: Visual thinking can improve spatial reasoning ability; training interventions (such as VDrop) can guide model behavior; there is a need to balance the learnability and informativeness of representations; out-of-domain generalization is important for practical applications.
**Limitations**: Relies on synthetic data, high computational cost, strong task specificity.
**Future Directions**: Explore other thinking image representations; extend VDrop to other tasks; train on real data; combine visual and language thinking.
