Section 01
【Introduction】Spatial Reasoning Reinforcement Learning Without Labeled Data: Consistency Verifier Unleashes the Potential of Large Models
Researchers propose a self-supervised reinforcement learning framework that aligns the spatial reasoning capabilities of large language models using a consistency verifier. This method requires no labeled data, leverages image and text transformations as reward signals, and achieves performance close to supervised training on multiple tasks. Key innovations include the consistency verifier (which checks geometric and semantic consistency under transformations) and the OT-GRPO strategy (optimal transport-driven policy optimization), providing new insights for the fields of spatial reasoning and self-supervised learning.