Zing Forum

Reading

SOLE-R1: Using Video Language Reasoning as the Sole Reward Signal for Robot Reinforcement Learning

This article introduces SOLE-R1, a video language reasoning model specifically designed for robot reinforcement learning. Through spatiotemporal chain-of-thought reasoning, the model generates dense task progress estimates as reward signals, enabling robots to learn 24 unseen manipulation tasks from scratch without real rewards, demonstrations, or task-specific tuning.

SOLE-R1机器人强化学习视觉语言模型视频推理奖励信号时空思维链奖励黑客零样本学习具身智能操作任务
Published 2026-03-31 01:46Recent activity 2026-03-31 12:20Estimated read 6 min
SOLE-R1: Using Video Language Reasoning as the Sole Reward Signal for Robot Reinforcement Learning
1

Section 01

SOLE-R1: Using Video Language Reasoning as the Sole Reward Signal for Robot RL (Introduction)

This post introduces SOLE-R1, a video language reasoning model designed specifically for robot reinforcement learning (RL). It generates dense task progress estimates via spatiotemporal chain-of-thought reasoning to serve as the sole reward signal. Notably, SOLE-R1 enables robots to learn 24 unseen manipulation tasks from scratch without real rewards, demonstrations, or task-specific tuning, addressing the reward hacking problem common with general visual language models (VLMs) in RL applications.

2

Section 02

Research Background & Challenges

VLMs have shown strong capabilities in image understanding and visual QA, inspiring their use to supervise robot learning. However, when top VLMs are used as RL evaluators, they often fail under partial observability and distribution shifts, leading to reward hacking—strategies exploit perceptual errors for fake high rewards instead of solving tasks, which is a core barrier to VLM application in robot RL.

3

Section 03

Core Innovations of SOLE-R1

SOLE-R1 (Self-Observing LEarner) is tailored for online RL as the sole reward source, with key features:

  1. Spatiotemporal Chain-of-Thought Reasoning: At each time step, it tracks object positions, action progress, and task stages to generate dense task progress estimates.
  2. Large-Scale Video Trajectory Synthesis Pipeline: Automatically generates time-anchored chain-of-thought trajectories aligned with continuous progress signals for training.
  3. Hybrid Training Framework: Combines supervised fine-tuning (SFT) and reward-verified RL (RLVR) to learn basic reasoning and optimize reward robustness.
4

Section 04

Experimental Validation Results

SOLE-R1 was tested in 4 simulation environments and real robots:

  • Zero-shot Online Learning: Robots start from random policies, learning without real rewards, success indicators, demos, or task-specific tuning.
  • 24 Unseen Tasks: Mastered 24 manipulation tasks (grab, place, stack, push/pull) not seen during training.
  • Superior to Top VLMs: Outperforms GPT-5 and Gemini-3-Pro in task success rate and shows stronger anti-reward hacking ability (distinguishes real progress from fake surface success).
5

Section 05

Technical Significance & Industry Impact

SOLE-R1's contributions:

  1. Free from Real Reward Dependence: Reduces reliance on manually designed real rewards (needing domain expertise and tuning) by using natural language task descriptions.
  2. Solve Reward Hacking: Specialized training and spatiotemporal reasoning identify real progress, avoiding deception by surface visual similarities.
  3. Towards General Robot Intelligence: Serves as a unified interface for evaluating diverse tasks without task-specific reward design, a step toward general agents.
6

Section 06

Limitations & Future Directions

SOLE-R1 has room for improvement:

  • Computational Overhead: Spatiotemporal reasoning on multi-frame videos is costlier than single-frame VLMs; future work may use model compression or efficient inference.
  • Complex Long Tasks: Accuracy of progress estimates for tasks needing hundreds of steps can be improved (e.g., combining with hierarchical RL).
  • Real-World Generalization: Needs further research on broader scenarios (different lighting, object categories).
7

Section 07

Conclusion

SOLE-R1 is a key breakthrough in robot learning. By specializing video language reasoning as RL's sole reward signal, it solves general VLMs' failure in RL and opens a path to more general, autonomous robot learning. Such systems bridging high-level semantic understanding and low-level control will play a critical role in building truly intelligent robot assistants as embodied intelligence advances.