Zing Forum

Reading

SpatialLadder: A Three-Stage Progressive Training Framework for Spatial Reasoning in Vision-Language Models

The SpatialLadder framework proposed by the REAL Lab at Zhejiang University enables a 3B-parameter vision-language model (VLM) to outperform GPT-4o and Gemini-2.0-Flash on spatial reasoning tasks through a three-stage progressive training strategy. The paper has been accepted by ICLR 2026.

视觉语言模型空间推理渐进式训练多模态学习强化学习ICLR 2026浙江大学开源模型
Published 2026-06-09 15:34Recent activity 2026-06-09 15:49Estimated read 6 min
SpatialLadder: A Three-Stage Progressive Training Framework for Spatial Reasoning in Vision-Language Models
1

Section 01

[Introduction] SpatialLadder: A Spatial Reasoning Training Framework for Small Models to Outperform Large Models

The REAL Lab at Zhejiang University proposes the SpatialLadder three-stage progressive training framework. Using a hierarchical training strategy of perception → understanding → reasoning, this framework enables a 3B-parameter vision-language model (VLM) to outperform GPT-4o and Gemini-2.0-Flash on spatial reasoning tasks. The related paper has been accepted by ICLR 2026. The project has open-sourced the code, paper, pre-trained model, dedicated dataset SpatialLadder-26k, and benchmark test SPBench.

2

Section 02

Research Background: Bottlenecks in Spatial Reasoning of VLMs and Defects of Existing Methods

Vision-language models have made significant progress in tasks like image understanding and question answering, but their spatial reasoning capabilities (e.g., relative positions of objects, multi-view integration, video trajectory tracking) are weak. Existing methods directly train complex spatial reasoning while ignoring the hierarchical perceptual foundation, leading to an unstable base.

3

Section 03

SpatialLadder Framework: Three-Stage Progressive Training Strategy

The framework follows the principle of progressive learning in cognitive science and is divided into three stages:

  1. Spatial Perception Stage: Establish object-position mapping through object detection/localization tasks to solidify the foundation;
  2. Spatial Understanding Stage: Train single-image/multi-view/video spatial reasoning capabilities using the SpatialLadder-26k dataset;
  3. Complex Reasoning Stage: Introduce reinforcement learning with verifiable rewards to enhance multi-step reasoning and spatial imagination abilities.
4

Section 04

Dataset Support: Features of SpatialLadder-26k

The SpatialLadder-26k dataset built by the research team contains 26,610 annotated samples, covering four major task categories: object localization, single-image/multi-view/video reasoning. The annotations are consistent and accurate, covering various scenarios, and have been open-sourced on Hugging Face.

5

Section 05

Experimental Results: 3B Model Outperforms Commercial Large Models

SpatialLadder-3B performs excellently in spatial reasoning benchmark tests:

  • An average improvement of 23.4% over the base model;
  • Outperforms GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%;
  • A 7.2% improvement in generalization ability on out-of-domain benchmarks.
6

Section 06

Technical Highlights: Three Key Innovations

  1. Progressive Training Paradigm: Breaks the limitations of end-to-end training and builds spatial intelligence hierarchically;
  2. Reinforcement Learning with Verifiable Rewards: Uses the feature that spatial reasoning answers can be automatically verified to improve training efficiency and stability;
  3. High-Quality Dedicated Dataset: Standardized construction process ensures data systematicness and consistency.
7

Section 07

Application Prospects: Research Contributions and Practical Value

  • Research Contributions: Verify the effectiveness of progressive training, enable small models to outperform large models in specific domains, and enrich the open-source ecosystem;
  • Practical Applications: Improve spatial understanding capabilities in scenarios such as robot navigation, autonomous driving, augmented reality, and intelligent surveillance.
8

Section 08

Summary and Outlook: A Milestone in Spatial Reasoning Training

SpatialLadder is an important milestone in the cultivation of spatial reasoning capabilities for VLMs, proving that optimizing training strategies is more critical than scaling up. This framework provides a reference for the cultivation of complex AI capabilities, and we look forward to inspiring more training paradigm innovations after its acceptance by ICLR 2026.