Reading

SpatialLadder: A Three-Stage Progressive Training Framework for Spatial Reasoning in Vision-Language Models

The SpatialLadder framework proposed by the REAL Lab at Zhejiang University enables a 3B-parameter vision-language model (VLM) to outperform GPT-4o and Gemini-2.0-Flash on spatial reasoning tasks through a three-stage progressive training strategy. The paper has been accepted by ICLR 2026.

视觉语言模型空间推理渐进式训练多模态学习强化学习ICLR 2026浙江大学开源模型

Published 2026-06-09 15:34Recent activity 2026-06-09 15:49Estimated read 6 min

SpatialLadder: A Three-Stage Progressive Training Framework for Spatial Reasoning in Vision-Language Models

Section 01

[Introduction] SpatialLadder: A Spatial Reasoning Training Framework for Small Models to Outperform Large Models

The REAL Lab at Zhejiang University proposes the SpatialLadder three-stage progressive training framework. Using a hierarchical training strategy of perception → understanding → reasoning, this framework enables a 3B-parameter vision-language model (VLM) to outperform GPT-4o and Gemini-2.0-Flash on spatial reasoning tasks. The related paper has been accepted by ICLR 2026. The project has open-sourced the code, paper, pre-trained model, dedicated dataset SpatialLadder-26k, and benchmark test SPBench.

Section 02

Research Background: Bottlenecks in Spatial Reasoning of VLMs and Defects of Existing Methods

Vision-language models have made significant progress in tasks like image understanding and question answering, but their spatial reasoning capabilities (e.g., relative positions of objects, multi-view integration, video trajectory tracking) are weak. Existing methods directly train complex spatial reasoning while ignoring the hierarchical perceptual foundation, leading to an unstable base.

Section 03

SpatialLadder Framework: Three-Stage Progressive Training Strategy

The framework follows the principle of progressive learning in cognitive science and is divided into three stages:

Spatial Perception Stage: Establish object-position mapping through object detection/localization tasks to solidify the foundation;
Spatial Understanding Stage: Train single-image/multi-view/video spatial reasoning capabilities using the SpatialLadder-26k dataset;
Complex Reasoning Stage: Introduce reinforcement learning with verifiable rewards to enhance multi-step reasoning and spatial imagination abilities.

Section 04

Dataset Support: Features of SpatialLadder-26k

The SpatialLadder-26k dataset built by the research team contains 26,610 annotated samples, covering four major task categories: object localization, single-image/multi-view/video reasoning. The annotations are consistent and accurate, covering various scenarios, and have been open-sourced on Hugging Face.

Section 05

Experimental Results: 3B Model Outperforms Commercial Large Models

SpatialLadder-3B performs excellently in spatial reasoning benchmark tests:

An average improvement of 23.4% over the base model;
Outperforms GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%;
A 7.2% improvement in generalization ability on out-of-domain benchmarks.

Section 06

Technical Highlights: Three Key Innovations

Progressive Training Paradigm: Breaks the limitations of end-to-end training and builds spatial intelligence hierarchically;
Reinforcement Learning with Verifiable Rewards: Uses the feature that spatial reasoning answers can be automatically verified to improve training efficiency and stability;
High-Quality Dedicated Dataset: Standardized construction process ensures data systematicness and consistency.

Section 07

Application Prospects: Research Contributions and Practical Value

Research Contributions: Verify the effectiveness of progressive training, enable small models to outperform large models in specific domains, and enrich the open-source ecosystem;
Practical Applications: Improve spatial understanding capabilities in scenarios such as robot navigation, autonomous driving, augmented reality, and intelligent surveillance.

Section 08

Summary and Outlook: A Milestone in Spatial Reasoning Training

SpatialLadder is an important milestone in the cultivation of spatial reasoning capabilities for VLMs, proving that optimizing training strategies is more critical than scaling up. This framework provides a reference for the cultivation of complex AI capabilities, and we look forward to inspiring more training paradigm innovations after its acceptance by ICLR 2026.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23