# Spatial Reasoning Reinforcement Learning Without Labeled Data: Consistency Verifier Unleashes the Potential of Large Models

> Researchers propose a self-supervised reinforcement learning framework that aligns the spatial reasoning capabilities of large language models via a consistency verifier. This method requires no labeled data, uses image and text transformations as reward signals, and achieves performance close to supervised training on multiple tasks.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-10T10:50:06.000Z
- 最近活动: 2026-06-11T04:22:35.113Z
- 热度: 133.5
- 关键词: 空间推理, 强化学习, 自监督学习, 大语言模型, 一致性验证, 最优传输, GRPO, 机器学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2606-11918v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2606-11918v1
- Markdown 来源: floors_fallback

---

## 【Introduction】Spatial Reasoning Reinforcement Learning Without Labeled Data: Consistency Verifier Unleashes the Potential of Large Models

Researchers propose a self-supervised reinforcement learning framework that aligns the spatial reasoning capabilities of large language models using a consistency verifier. This method requires no labeled data, leverages image and text transformations as reward signals, and achieves performance close to supervised training on multiple tasks. Key innovations include the consistency verifier (which checks geometric and semantic consistency under transformations) and the OT-GRPO strategy (optimal transport-driven policy optimization), providing new insights for the fields of spatial reasoning and self-supervised learning.

## Background: Spatial Reasoning – The Achilles' Heel of Large Models

Current large reasoning models (LRMs) perform poorly on spatial reasoning tasks, despite their strong capabilities in tasks like poetry writing and programming. The traditional view attributes this gap to knowledge deficits, with solutions relying on supervised fine-tuning (SFT) to supplement spatial data. However, this study presents a different perspective: models already possess relevant capabilities, and the problem lies in not activating and aligning them correctly (alignment via logically consistent geometric constraints is needed).

## Method: Consistency Verifier and OT-GRPO Strategy

**Consistency Verifier**: As a self-supervised reward function, it checks the consistency of reasoning results through image transformations (horizontal/vertical flipping, rotation) and text transformations (swapping object order, reversing relationships).

**OT-GRPO Strategy**: To address the efficiency issue of paired verification signals, optimal transport theory is introduced to capture pairing structures by minimizing matching costs. Steps include generating candidate responses, reasoning on original and transformed inputs, optimal transport pairing, and feedback-based policy update.

## Experimental Evidence: Performance Close to Supervised Learning and Generalization Ability

Experimental results show that the fully unlabeled consistency training method achieves accuracy close to supervised training models. The model performs well on multiple types of spatial reasoning tasks (2D relationships, 3D understanding, compositional reasoning) and has strong generalization across data domains (synthetic/real images, simple/complex scenes), indicating that it has learned general spatial reasoning principles.

## Conclusion: Reconsidering the AI Learning Paradigm

This study challenges traditional assumptions: 1. **Data Efficiency**: Self-supervised signals can replace expensive annotations (suitable for data-scarce domains); 2. **Capability Alignment**: No need to inject new knowledge—existing capabilities need to be activated; 3. **Value of Consistency**: Consistency constraints can be extended to multiple domains (geometry, logic, semantics), opening up directions for new algorithm design.

## Practical Implications: Multi-faceted Applications from Training to Diagnosis

1. **New Perspective on Data Augmentation**: Data augmentation can serve as a source of consistency verification (design transformations that preserve attributes); 2. **Model Diagnosis Tool**: Identify weak points through consistency checks under transformations; 3. **Multimodal Framework**: Combining image and text transformations provides self-supervised signals for vision-language models.

## Limitations and Future Research Directions

Current limitations: 1. **Dependency on Transformation Design**: The effectiveness of the verifier depends on transformation design—need to explore automatic learning of optimal transformations; 2. **Extension to Complex Scenarios**: Need to address consistency verification for dynamic environments and non-rigid objects; 3. **Combination with Semi-Supervised Learning**: Explore optimal strategies for combining small amounts of supervised data with self-supervised methods.
