Zing Forum

Reading

DeepThinkVLA: An Innovative Framework for Endowing Visual-Language-Action Models with Explicit Reasoning Capabilities

DeepThinkVLA significantly enhances the reasoning ability of VLA models through a hybrid attention decoder and explicit Chain-of-Thought (CoT) mechanism, achieving an average success rate of 97% on the LIBERO benchmark.

VLA具身智能思维链机器人强化学习视觉语言模型LIBERO
Published 2026-04-16 18:43Recent activity 2026-04-16 18:51Estimated read 7 min
DeepThinkVLA: An Innovative Framework for Endowing Visual-Language-Action Models with Explicit Reasoning Capabilities
1

Section 01

[Introduction] DeepThinkVLA: An Innovative Framework for Endowing VLA Models with Explicit Reasoning Capabilities

Developed by the OpenBMB team, DeepThinkVLA addresses the lack of explicit reasoning in existing Visual-Language-Action (VLA) models through a hybrid attention decoder and explicit Chain-of-Thought (CoT) mechanism, significantly improving decision quality and task success rates. This framework achieves an average success rate of 97% on the LIBERO benchmark, providing an interpretable and highly robust solution for the field of embodied intelligence.

2

Section 02

Research Background and Motivation

VLA models are a key direction in robot control, capable of generating action sequences based on visual observations and natural language instructions. However, most existing VLA models use end-to-end reactive architectures and lack explicit reasoning, leading to poor performance in complex tasks or unexpected situations. DeepThinkVLA draws on the CoT prompting technique from large language models and innovatively applies it to the field of embodied intelligence, allowing robots to "think" before executing actions to improve decision quality.

3

Section 03

Core Innovations: Hybrid Attention Decoder and Latency Optimization

The core of DeepThinkVLA is its hybrid attention decoder architecture: the 2.9-billion-parameter decoder is split into two stages—an autoregressive reasoning stage to generate a complete Chain-of-Thought, followed by switching to a bidirectional attention mechanism to output action blocks in parallel, resolving modal conflict issues. To address reasoning latency, the Masked-CoT strategy is proposed, which masks reasoning tokens while retaining action-related information. This maintains a 96.5% success rate while reducing latency to only 0.175 times that of the baseline.

4

Section 04

Data Engine and Training Pipeline

Data Engine: Two-stage CoT annotation pipeline—1. Key frame extraction + cloud-based Large Visual-Language Model (LVLM) annotation generation + manual review; 2. Fine-tuning a local VLM with high-quality samples to automatically annotate remaining frames, ensuring trajectory coherence. The constructed LIBERO CoT dataset has been open-sourced.

Training Pipeline: Two-stage training—1. Supervised Fine-Tuning (SFT) uses cross-entropy loss to learn reasoning-action coordination; 2. Reinforcement learning based on Grouped Reinforcement Policy Optimization (GRPO), which improves long-term task performance (LIBERO-Long task success rate increased from 94.2% to 96.2%) through sparse reward normalization and KL regularization.

5

Section 05

Performance Evaluation and Experimental Results

LIBERO Benchmark: Average success rate of 97% (99% for Object class, 96.6% for Spatial class, 96.4% for Goal class, 96.2% for Long class), outperforming baselines like autoregressive and diffusion models.

Architecture Comparison: The hybrid decoder improves performance by 15.5% compared to the autoregressive CoT variant; random CoT reduces performance to 85.1%, demonstrating the importance of reasoning quality.

Zero-Shot Transfer: Zero-shot testing on LIBERO Plus (with perturbations in object layout, instructions, etc.) achieves an overall success rate of 79%, showing good robustness.

6

Section 06

Qualitative Analysis and Research Significance

Self-Correction Capability: The explicit reasoning mechanism allows the model to identify execution errors (e.g., object dropping) and guide recovery actions via the Chain-of-Thought, while reactive baselines tend to stagnate.

Research Significance: Shifting from end-to-end black-box mapping to interpretable and debuggable explicit reasoning improves the safety and controllability of robot systems. In the future, further integration of reinforcement learning and VLA can be expected to promote the deployment of intelligent robots.

7

Section 07

Open-Source Resources and Usage Guide

Open-Source Resources: Model weights (base/SFT/RL versions), LIBERO CoT dataset, training and evaluation scripts, DeepSpeed configurations, etc.

Environment Requirements: Linux/WSL + NVIDIA GPU (CUDA 12.x), Python ≥3.10; SFT requires 8x80GB GPUs.

Usage Tips: Enabling Masked-CoT during evaluation reduces latency. The project is built on components like Hugging Face, and related projects are acknowledged.