Zing Forum

Reading

LaRA-VLA: The Implicit Reasoning Revolution in Robot Intelligence

Teams including Peking University proposed LaRA-VLA, an implicit reasoning-based vision-language-action model that achieves more efficient robot decision-making and action prediction through internal hidden state iteration instead of explicit chain-of-thought generation.

VLA机器人隐式推理具身智能视觉语言模型北京大学AI机器学习
Published 2026-04-08 00:43Recent activity 2026-04-08 00:51Estimated read 5 min
LaRA-VLA: The Implicit Reasoning Revolution in Robot Intelligence
1

Section 01

LaRA-VLA: The Implicit Reasoning Revolution in Robot Intelligence (Introduction)

Teams including Peking University proposed LaRA-VLA, an implicit reasoning-based Vision-Language-Action (VLA) model. By iterating internal hidden states instead of generating explicit chain-of-thought, it addresses the trade-off between reasoning depth and speed in traditional VLA models. It performs excellently in benchmark tests and provides a new paradigm for real-time robot control.

2

Section 02

Background: The Trade-off Dilemma in Robot Decision-Making

In the field of embodied intelligence, VLA models are core technologies for robot control, but they face a trade-off dilemma: end-to-end models respond quickly but lack deep reasoning; explicit Chain-of-Thought (CoT) methods can handle complex reasoning but generate large amounts of text leading to high latency, which is hard to meet real-time control requirements. For example, in the task of "putting a spoon into a bowl", explicit CoT requires hundreds of tokens for explanation, while robot control demands millisecond-level responses.

3

Section 03

Core Innovations and Technical Architecture of LaRA-VLA

LaRA-VLA adopts implicit latent reasoning and improves efficiency by iteratively updating hidden states instead of generating visible text. Its core mechanism is the "latent reasoning slot": encoding visual and language information into continuous latent vectors, then outputting actions after multi-step iterative optimization. Advantages include: high computational efficiency (matrix operations in latent space replace text generation), high information density (avoiding language limitations), and end-to-end trainability (optimized via backpropagation). The training uses a two-stage strategy: first, basic VLA pre-training, then reinforced latent reasoning training.

4

Section 04

Performance Evidence: Benchmark Tests and Real Task Performance

In the LIBERO benchmark test, LaRA-VLA achieved an average success rate of 97.9%, outperforming traditional non-CoT methods (OpenVLA:76.5%, π₀:94.2%), and was faster than explicit CoT methods (DeepThinkVLA:97.0%). In the real Bridge task, the success rate of the "spoon placement" task was 95.8%, far exceeding other methods.

5

Section 05

Practical Application Significance and Value

LaRA-VLA provides a new paradigm for real-time robot control, resolving the contradiction between reasoning depth and speed; it can be extended to multi-step planning, tool use, and other human-robot collaboration tasks. For developers: strong AI capabilities can be deployed on ordinary hardware; for researchers: it opens up a new research direction for implicit reasoning.

6

Section 06

Open Source Status and Future Research Directions

The research team open-sourced the training and evaluation code (based on StarVLA), while pre-trained model weights and datasets have not been released yet. Future directions include: expanding modalities such as touch/audition, optimizing the design of reasoning slots, and applying to other sequential decision-making tasks.

7

Section 07

Conclusion: Implicit Reasoning Leads a New Direction in Robot Intelligence

LaRA-VLA proves that there is no need to choose between reasoning depth and speed; through latent space reasoning, both advantages can be obtained simultaneously. It is an important turning point in robot intelligence research and promotes the development of practical intelligent robot assistants.