Reading

BeTTER Benchmark: Debunking the Illusion of Embodied Reasoning Capabilities in VLA Models

BeTTER, through causal intervention and kinematic isolation methods, decouples high-level reasoning failures from low-level execution constraints for the first time, revealing severe cognitive deficits in semantic understanding and sequence planning in current VLA models.

VLA模型具身智能基准测试因果干预机器人推理视觉语言模型行为惯性语义理解

Published 2026-04-21 14:11Recent activity 2026-04-21 14:20Estimated read 5 min

Section 01

BeTTER Benchmark: Debunking the Illusion of Embodied Reasoning Capabilities in VLA Models [Introduction]

The BeTTER benchmark decouples high-level reasoning failures from low-level execution constraints for the first time using causal intervention and kinematic isolation methods, revealing severe cognitive deficits in semantic understanding and sequence planning in current VLA models. This thread will introduce core content such as background, methodology, and diagnostic findings in separate floors.

Section 02

Background: The Glories and Hidden Concerns of VLA Models

In recent years, Vision-Language-Action (VLA) models have achieved impressive success rates in robot manipulation benchmarks, demonstrating seemingly strong semantic understanding and sequence planning capabilities. However, teams from Peking University, Tsinghua University, and BeingBeyond question whether these successes mask deep cognitive deficits, and have launched the BeTTER benchmark to debunk the "illusion" of such capabilities.

Section 03

The Nature of the Illusion: Execution Success ≠ Correct Reasoning

Current evaluations conflate task completion with correct reasoning. Models may complete tasks through behavioral inertia (repeating high-frequency actions from training) rather than semantic understanding, or recognize objects but misunderstand their functional/spatial relationships. BeTTER refers to this as "embodied reasoning illusion", where traditional metrics only focus on results and ignore the cognitive process.

Section 04

BeTTER Methodology: Causal Intervention and Kinematic Isolation

The core innovations of BeTTER are causal intervention and kinematic isolation:

Causal intervention: Modify environmental variables (e.g., physical properties while keeping object appearance unchanged) to observe the model's sensitivity to semantically relevant interventions;
Kinematic isolation: Decouple action outputs from a perfect executor to distinguish between cognitive failure (not knowing what to do) and execution failure (being unable to do it).

Section 05

Diagnostic Findings: Behavioral Inertia and Semantic Feature Collapse

BeTTER evaluations reveal two major flaws in state-of-the-art (SOTA) VLA models:

Behavioral inertia: Over-reliance on specific action sequences leads to failure in generalized scenarios due to inability to adapt flexibly;
Semantic feature collapse: Recognize visual features of objects but fail to establish mappings to functional attributes (e.g., knowing a cup but not understanding its use as a container).

Section 06

BeTTER Benchmark Suite: Multi-Dimensional Evaluation System

BeTTER includes 10 basic manipulation tasks + 60 diagnostic variants, manipulating object properties, spatial configurations, etc., to form a multi-dimensional evaluation grid. It also provides data augmentation, privileged logging tools, integrates with MimicGen to generate training data, and supports analysis of internal model representations.

Section 07

Open-Source Roadmap and Significance of Community Contributions

BeTTER adopts a progressive open-source strategy, having already released papers and frameworks, with plans to open task generation pipelines and more in the future. Dependent on tools like Objaverse and MimicGen, it calls for the establishment of an evaluation system that better reflects real cognitive capabilities to promote the maturity of embodied intelligence technology.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49