Zing Forum

Reading

BeTTER Benchmark: Debunking the Illusion of Embodied Reasoning Capabilities in VLA Models

BeTTER, through causal intervention and kinematic isolation methods, decouples high-level reasoning failures from low-level execution constraints for the first time, revealing severe cognitive deficits in semantic understanding and sequence planning in current VLA models.

VLA模型具身智能基准测试因果干预机器人推理视觉语言模型行为惯性语义理解
Published 2026-04-21 14:11Recent activity 2026-04-21 14:20Estimated read 5 min
BeTTER Benchmark: Debunking the Illusion of Embodied Reasoning Capabilities in VLA Models
1

Section 01

BeTTER Benchmark: Debunking the Illusion of Embodied Reasoning Capabilities in VLA Models [Introduction]

The BeTTER benchmark decouples high-level reasoning failures from low-level execution constraints for the first time using causal intervention and kinematic isolation methods, revealing severe cognitive deficits in semantic understanding and sequence planning in current VLA models. This thread will introduce core content such as background, methodology, and diagnostic findings in separate floors.

2

Section 02

Background: The Glories and Hidden Concerns of VLA Models

In recent years, Vision-Language-Action (VLA) models have achieved impressive success rates in robot manipulation benchmarks, demonstrating seemingly strong semantic understanding and sequence planning capabilities. However, teams from Peking University, Tsinghua University, and BeingBeyond question whether these successes mask deep cognitive deficits, and have launched the BeTTER benchmark to debunk the "illusion" of such capabilities.

3

Section 03

The Nature of the Illusion: Execution Success ≠ Correct Reasoning

Current evaluations conflate task completion with correct reasoning. Models may complete tasks through behavioral inertia (repeating high-frequency actions from training) rather than semantic understanding, or recognize objects but misunderstand their functional/spatial relationships. BeTTER refers to this as "embodied reasoning illusion", where traditional metrics only focus on results and ignore the cognitive process.

4

Section 04

BeTTER Methodology: Causal Intervention and Kinematic Isolation

The core innovations of BeTTER are causal intervention and kinematic isolation:

  • Causal intervention: Modify environmental variables (e.g., physical properties while keeping object appearance unchanged) to observe the model's sensitivity to semantically relevant interventions;
  • Kinematic isolation: Decouple action outputs from a perfect executor to distinguish between cognitive failure (not knowing what to do) and execution failure (being unable to do it).
5

Section 05

Diagnostic Findings: Behavioral Inertia and Semantic Feature Collapse

BeTTER evaluations reveal two major flaws in state-of-the-art (SOTA) VLA models:

  1. Behavioral inertia: Over-reliance on specific action sequences leads to failure in generalized scenarios due to inability to adapt flexibly;
  2. Semantic feature collapse: Recognize visual features of objects but fail to establish mappings to functional attributes (e.g., knowing a cup but not understanding its use as a container).
6

Section 06

BeTTER Benchmark Suite: Multi-Dimensional Evaluation System

BeTTER includes 10 basic manipulation tasks + 60 diagnostic variants, manipulating object properties, spatial configurations, etc., to form a multi-dimensional evaluation grid. It also provides data augmentation, privileged logging tools, integrates with MimicGen to generate training data, and supports analysis of internal model representations.

7

Section 07

Open-Source Roadmap and Significance of Community Contributions

BeTTER adopts a progressive open-source strategy, having already released papers and frameworks, with plans to open task generation pipelines and more in the future. Dependent on tools like Objaverse and MimicGen, it calls for the establishment of an evaluation system that better reflects real cognitive capabilities to promote the maturity of embodied intelligence technology.