# WBench: A Comprehensive Multi-Round Benchmark for Evaluating Interactive Video World Models

> The Meituan team has launched the WBench benchmark, which covers 289 test cases and 1058 interaction rounds, and comprehensively evaluates interactive video world models from five dimensions: video quality, setting adherence, interaction adherence, consistency, and physical compliance.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-25T14:01:31.000Z
- 最近活动: 2026-05-26T05:48:52.488Z
- 热度: 131.2
- 关键词: 世界模型, 视频生成, 基准测试, 多模态评估, 交互式AI, 美团
- 页面链接: https://www.zingnex.cn/en/forum/thread/wbench
- Canonical: https://www.zingnex.cn/forum/thread/wbench
- Markdown 来源: floors_fallback

---

## WBench: Introduction to the Comprehensive Multi-Round Benchmark for Evaluating Interactive Video World Models

The Meituan team has launched the WBench benchmark, aiming to comprehensively evaluate interactive video world models. This benchmark covers 289 test cases and 1058 interaction rounds, and assesses models from five dimensions: video quality, setting adherence, interaction adherence, consistency, and physical compliance. The code and data have been open-sourced (GitHub link: https://github.com/meituan-longcat/WBench), providing a unified evaluation standard for academia and industry.

## Background: Three Major Challenges in Existing Interactive World Model Evaluation

Interactive world models have broad application prospects in fields such as games and film/television, but existing evaluations have shortcomings:
1. Fragmented evaluation dimensions, lack of a unified framework;
2. Lack of multi-round interaction tests, making it difficult to simulate real scenarios;
3. Ununified control methods, making fair comparison between models difficult.

## WBench Core Design: Five Key Evaluation Dimensions

WBench evaluates models from five dimensions:
1. **Video Quality**: Clarity, coherence, realism;
2. **Setting Adherence**: Accurately understanding settings such as scenes, styles, and subjects;
3. **Interaction Adherence**: Executing instructions and memorizing history during multi-round interactions;
4. **Consistency**: Stability of subjects, scenes, and time across rounds;
5. **Physical Compliance**: Conforming to physical laws such as gravity and collision.

## WBench Test Dataset and Interaction Types

The dataset contains 289 test cases and 1058 interaction rounds, covering diversity in scenes (indoor/outdoor, etc.), styles (realistic/cartoon, etc.), subjects (humans/animals, etc.), and perspectives (first/third person).
Interaction types include four categories: navigation, subject action, event editing, and perspective switching.
The navigation task unifies three control methods: text control, 6-degree-of-freedom pose, and discrete actions to ensure fair comparison.

## WBench Evaluation Method: 22 Automatic Sub-indicators

WBench uses 22 automatic sub-indicators for evaluation:
- Combining computer vision models to assess video quality, object detection, etc.;
- Using large multi-modal models to judge semantic understanding and consistency;
- All indicators are verified by manual annotation to ensure consistency with human judgment.

## Key Findings: No All-Round Model, Each Model Has Its Strengths and Weaknesses

Testing 20 advanced models revealed: no single model performs excellently in all dimensions. Characteristics of different models:
- Some models have excellent video quality but poor physical compliance;
- Some are good at setting adherence but lack multi-round consistency;
- Some excel in specific interaction types but are average in others. This reveals that the field still needs improvement.

## Open Source and Significance: Promoting the Development of Interactive World Model Technology

The WBench code and data have been open-sourced (https://github.com/meituan-longcat/WBench), providing a unified evaluation standard. Its release marks a new stage in evaluation, helping researchers understand the strengths and weaknesses of models and accelerating technological progress; it provides optimization goals for developers, and more reliable interactive video tools will be available in the future.
