# DeepThinkVLA: An Innovative Framework for Endowing Visual-Language-Action Models with Explicit Reasoning Capabilities

> DeepThinkVLA significantly enhances the reasoning ability of VLA models through a hybrid attention decoder and explicit Chain-of-Thought (CoT) mechanism, achieving an average success rate of 97% on the LIBERO benchmark.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-16T10:43:05.000Z
- 最近活动: 2026-04-16T10:51:06.513Z
- 热度: 148.9
- 关键词: VLA, 具身智能, 思维链, 机器人, 强化学习, 视觉语言模型, LIBERO
- 页面链接: https://www.zingnex.cn/en/forum/thread/deepthinkvla
- Canonical: https://www.zingnex.cn/forum/thread/deepthinkvla
- Markdown 来源: floors_fallback

---

## [Introduction] DeepThinkVLA: An Innovative Framework for Endowing VLA Models with Explicit Reasoning Capabilities

Developed by the OpenBMB team, DeepThinkVLA addresses the lack of explicit reasoning in existing Visual-Language-Action (VLA) models through a hybrid attention decoder and explicit Chain-of-Thought (CoT) mechanism, significantly improving decision quality and task success rates. This framework achieves an average success rate of 97% on the LIBERO benchmark, providing an interpretable and highly robust solution for the field of embodied intelligence.

## Research Background and Motivation

VLA models are a key direction in robot control, capable of generating action sequences based on visual observations and natural language instructions. However, most existing VLA models use end-to-end reactive architectures and lack explicit reasoning, leading to poor performance in complex tasks or unexpected situations. DeepThinkVLA draws on the CoT prompting technique from large language models and innovatively applies it to the field of embodied intelligence, allowing robots to "think" before executing actions to improve decision quality.

## Core Innovations: Hybrid Attention Decoder and Latency Optimization

The core of DeepThinkVLA is its hybrid attention decoder architecture: the 2.9-billion-parameter decoder is split into two stages—an autoregressive reasoning stage to generate a complete Chain-of-Thought, followed by switching to a bidirectional attention mechanism to output action blocks in parallel, resolving modal conflict issues. To address reasoning latency, the Masked-CoT strategy is proposed, which masks reasoning tokens while retaining action-related information. This maintains a 96.5% success rate while reducing latency to only 0.175 times that of the baseline.

## Data Engine and Training Pipeline

**Data Engine**: Two-stage CoT annotation pipeline—1. Key frame extraction + cloud-based Large Visual-Language Model (LVLM) annotation generation + manual review; 2. Fine-tuning a local VLM with high-quality samples to automatically annotate remaining frames, ensuring trajectory coherence. The constructed LIBERO CoT dataset has been open-sourced.

**Training Pipeline**: Two-stage training—1. Supervised Fine-Tuning (SFT) uses cross-entropy loss to learn reasoning-action coordination; 2. Reinforcement learning based on Grouped Reinforcement Policy Optimization (GRPO), which improves long-term task performance (LIBERO-Long task success rate increased from 94.2% to 96.2%) through sparse reward normalization and KL regularization.

## Performance Evaluation and Experimental Results

**LIBERO Benchmark**: Average success rate of 97% (99% for Object class, 96.6% for Spatial class, 96.4% for Goal class, 96.2% for Long class), outperforming baselines like autoregressive and diffusion models.

**Architecture Comparison**: The hybrid decoder improves performance by 15.5% compared to the autoregressive CoT variant; random CoT reduces performance to 85.1%, demonstrating the importance of reasoning quality.

**Zero-Shot Transfer**: Zero-shot testing on LIBERO Plus (with perturbations in object layout, instructions, etc.) achieves an overall success rate of 79%, showing good robustness.

## Qualitative Analysis and Research Significance

**Self-Correction Capability**: The explicit reasoning mechanism allows the model to identify execution errors (e.g., object dropping) and guide recovery actions via the Chain-of-Thought, while reactive baselines tend to stagnate.

**Research Significance**: Shifting from end-to-end black-box mapping to interpretable and debuggable explicit reasoning improves the safety and controllability of robot systems. In the future, further integration of reinforcement learning and VLA can be expected to promote the deployment of intelligent robots.

## Open-Source Resources and Usage Guide

**Open-Source Resources**: Model weights (base/SFT/RL versions), LIBERO CoT dataset, training and evaluation scripts, DeepSpeed configurations, etc.

**Environment Requirements**: Linux/WSL + NVIDIA GPU (CUDA 12.x), Python ≥3.10; SFT requires 8x80GB GPUs.

**Usage Tips**: Enabling Masked-CoT during evaluation reduces latency. The project is built on components like Hugging Face, and related projects are acknowledged.