Reading

DeepThinkVLA: An Innovative Framework for Endowing Visual-Language-Action Models with Explicit Reasoning Capabilities

DeepThinkVLA significantly enhances the reasoning ability of VLA models through a hybrid attention decoder and explicit Chain-of-Thought (CoT) mechanism, achieving an average success rate of 97% on the LIBERO benchmark.

VLA具身智能思维链机器人强化学习视觉语言模型LIBERO

Published 2026-04-16 18:43Recent activity 2026-04-16 18:51Estimated read 7 min

DeepThinkVLA: An Innovative Framework for Endowing Visual-Language-Action Models with Explicit Reasoning Capabilities

Section 01

[Introduction] DeepThinkVLA: An Innovative Framework for Endowing VLA Models with Explicit Reasoning Capabilities

Developed by the OpenBMB team, DeepThinkVLA addresses the lack of explicit reasoning in existing Visual-Language-Action (VLA) models through a hybrid attention decoder and explicit Chain-of-Thought (CoT) mechanism, significantly improving decision quality and task success rates. This framework achieves an average success rate of 97% on the LIBERO benchmark, providing an interpretable and highly robust solution for the field of embodied intelligence.

Section 02

Research Background and Motivation

VLA models are a key direction in robot control, capable of generating action sequences based on visual observations and natural language instructions. However, most existing VLA models use end-to-end reactive architectures and lack explicit reasoning, leading to poor performance in complex tasks or unexpected situations. DeepThinkVLA draws on the CoT prompting technique from large language models and innovatively applies it to the field of embodied intelligence, allowing robots to "think" before executing actions to improve decision quality.

Section 03

Core Innovations: Hybrid Attention Decoder and Latency Optimization

The core of DeepThinkVLA is its hybrid attention decoder architecture: the 2.9-billion-parameter decoder is split into two stages—an autoregressive reasoning stage to generate a complete Chain-of-Thought, followed by switching to a bidirectional attention mechanism to output action blocks in parallel, resolving modal conflict issues. To address reasoning latency, the Masked-CoT strategy is proposed, which masks reasoning tokens while retaining action-related information. This maintains a 96.5% success rate while reducing latency to only 0.175 times that of the baseline.

Section 04

Data Engine and Training Pipeline

Data Engine: Two-stage CoT annotation pipeline—1. Key frame extraction + cloud-based Large Visual-Language Model (LVLM) annotation generation + manual review; 2. Fine-tuning a local VLM with high-quality samples to automatically annotate remaining frames, ensuring trajectory coherence. The constructed LIBERO CoT dataset has been open-sourced.

Training Pipeline: Two-stage training—1. Supervised Fine-Tuning (SFT) uses cross-entropy loss to learn reasoning-action coordination; 2. Reinforcement learning based on Grouped Reinforcement Policy Optimization (GRPO), which improves long-term task performance (LIBERO-Long task success rate increased from 94.2% to 96.2%) through sparse reward normalization and KL regularization.

Section 05

Performance Evaluation and Experimental Results

LIBERO Benchmark: Average success rate of 97% (99% for Object class, 96.6% for Spatial class, 96.4% for Goal class, 96.2% for Long class), outperforming baselines like autoregressive and diffusion models.

Architecture Comparison: The hybrid decoder improves performance by 15.5% compared to the autoregressive CoT variant; random CoT reduces performance to 85.1%, demonstrating the importance of reasoning quality.

Zero-Shot Transfer: Zero-shot testing on LIBERO Plus (with perturbations in object layout, instructions, etc.) achieves an overall success rate of 79%, showing good robustness.

Section 06

Qualitative Analysis and Research Significance

Self-Correction Capability: The explicit reasoning mechanism allows the model to identify execution errors (e.g., object dropping) and guide recovery actions via the Chain-of-Thought, while reactive baselines tend to stagnate.

Research Significance: Shifting from end-to-end black-box mapping to interpretable and debuggable explicit reasoning improves the safety and controllability of robot systems. In the future, further integration of reinforcement learning and VLA can be expected to promote the deployment of intelligent robots.

Section 07

Open-Source Resources and Usage Guide

Open-Source Resources: Model weights (base/SFT/RL versions), LIBERO CoT dataset, training and evaluation scripts, DeepSpeed configurations, etc.

Environment Requirements: Linux/WSL + NVIDIA GPU (CUDA 12.x), Python ≥3.10; SFT requires 8x80GB GPUs.

Usage Tips: Enabling Masked-CoT during evaluation reduces latency. The project is built on components like Hugging Face, and related projects are acknowledged.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15