# BeTTER Benchmark: Debunking the Illusion of Embodied Reasoning Capabilities in VLA Models

> BeTTER, through causal intervention and kinematic isolation methods, decouples high-level reasoning failures from low-level execution constraints for the first time, revealing severe cognitive deficits in semantic understanding and sequence planning in current VLA models.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-21T06:11:24.000Z
- 最近活动: 2026-04-21T06:20:49.021Z
- 热度: 150.8
- 关键词: VLA模型, 具身智能, 基准测试, 因果干预, 机器人推理, 视觉语言模型, 行为惯性, 语义理解
- 页面链接: https://www.zingnex.cn/en/forum/thread/better-vla
- Canonical: https://www.zingnex.cn/forum/thread/better-vla
- Markdown 来源: floors_fallback

---

## BeTTER Benchmark: Debunking the Illusion of Embodied Reasoning Capabilities in VLA Models [Introduction]

The BeTTER benchmark decouples high-level reasoning failures from low-level execution constraints for the first time using causal intervention and kinematic isolation methods, revealing severe cognitive deficits in semantic understanding and sequence planning in current VLA models. This thread will introduce core content such as background, methodology, and diagnostic findings in separate floors.

## Background: The Glories and Hidden Concerns of VLA Models

In recent years, Vision-Language-Action (VLA) models have achieved impressive success rates in robot manipulation benchmarks, demonstrating seemingly strong semantic understanding and sequence planning capabilities. However, teams from Peking University, Tsinghua University, and BeingBeyond question whether these successes mask deep cognitive deficits, and have launched the BeTTER benchmark to debunk the "illusion" of such capabilities.

## The Nature of the Illusion: Execution Success ≠ Correct Reasoning

Current evaluations conflate task completion with correct reasoning. Models may complete tasks through behavioral inertia (repeating high-frequency actions from training) rather than semantic understanding, or recognize objects but misunderstand their functional/spatial relationships. BeTTER refers to this as "embodied reasoning illusion", where traditional metrics only focus on results and ignore the cognitive process.

## BeTTER Methodology: Causal Intervention and Kinematic Isolation

The core innovations of BeTTER are causal intervention and kinematic isolation:
- Causal intervention: Modify environmental variables (e.g., physical properties while keeping object appearance unchanged) to observe the model's sensitivity to semantically relevant interventions;
- Kinematic isolation: Decouple action outputs from a perfect executor to distinguish between cognitive failure (not knowing what to do) and execution failure (being unable to do it).

## Diagnostic Findings: Behavioral Inertia and Semantic Feature Collapse

BeTTER evaluations reveal two major flaws in state-of-the-art (SOTA) VLA models:
1. Behavioral inertia: Over-reliance on specific action sequences leads to failure in generalized scenarios due to inability to adapt flexibly;
2. Semantic feature collapse: Recognize visual features of objects but fail to establish mappings to functional attributes (e.g., knowing a cup but not understanding its use as a container).

## BeTTER Benchmark Suite: Multi-Dimensional Evaluation System

BeTTER includes 10 basic manipulation tasks + 60 diagnostic variants, manipulating object properties, spatial configurations, etc., to form a multi-dimensional evaluation grid. It also provides data augmentation, privileged logging tools, integrates with MimicGen to generate training data, and supports analysis of internal model representations.

## Open-Source Roadmap and Significance of Community Contributions

BeTTER adopts a progressive open-source strategy, having already released papers and frameworks, with plans to open task generation pipelines and more in the future. Dependent on tools like Objaverse and MimicGen, it calls for the establishment of an evaluation system that better reflects real cognitive capabilities to promote the maturity of embodied intelligence technology.
