Zing Forum

Reading

WBench: A Comprehensive Multi-Round Benchmark for Evaluating Interactive Video World Models

The Meituan team has launched the WBench benchmark, which covers 289 test cases and 1058 interaction rounds, and comprehensively evaluates interactive video world models from five dimensions: video quality, setting adherence, interaction adherence, consistency, and physical compliance.

世界模型视频生成基准测试多模态评估交互式AI美团
Published 2026-05-25 22:01Recent activity 2026-05-26 13:48Estimated read 5 min
WBench: A Comprehensive Multi-Round Benchmark for Evaluating Interactive Video World Models
1

Section 01

WBench: Introduction to the Comprehensive Multi-Round Benchmark for Evaluating Interactive Video World Models

The Meituan team has launched the WBench benchmark, aiming to comprehensively evaluate interactive video world models. This benchmark covers 289 test cases and 1058 interaction rounds, and assesses models from five dimensions: video quality, setting adherence, interaction adherence, consistency, and physical compliance. The code and data have been open-sourced (GitHub link: https://github.com/meituan-longcat/WBench), providing a unified evaluation standard for academia and industry.

2

Section 02

Background: Three Major Challenges in Existing Interactive World Model Evaluation

Interactive world models have broad application prospects in fields such as games and film/television, but existing evaluations have shortcomings:

  1. Fragmented evaluation dimensions, lack of a unified framework;
  2. Lack of multi-round interaction tests, making it difficult to simulate real scenarios;
  3. Ununified control methods, making fair comparison between models difficult.
3

Section 03

WBench Core Design: Five Key Evaluation Dimensions

WBench evaluates models from five dimensions:

  1. Video Quality: Clarity, coherence, realism;
  2. Setting Adherence: Accurately understanding settings such as scenes, styles, and subjects;
  3. Interaction Adherence: Executing instructions and memorizing history during multi-round interactions;
  4. Consistency: Stability of subjects, scenes, and time across rounds;
  5. Physical Compliance: Conforming to physical laws such as gravity and collision.
4

Section 04

WBench Test Dataset and Interaction Types

The dataset contains 289 test cases and 1058 interaction rounds, covering diversity in scenes (indoor/outdoor, etc.), styles (realistic/cartoon, etc.), subjects (humans/animals, etc.), and perspectives (first/third person). Interaction types include four categories: navigation, subject action, event editing, and perspective switching. The navigation task unifies three control methods: text control, 6-degree-of-freedom pose, and discrete actions to ensure fair comparison.

5

Section 05

WBench Evaluation Method: 22 Automatic Sub-indicators

WBench uses 22 automatic sub-indicators for evaluation:

  • Combining computer vision models to assess video quality, object detection, etc.;
  • Using large multi-modal models to judge semantic understanding and consistency;
  • All indicators are verified by manual annotation to ensure consistency with human judgment.
6

Section 06

Key Findings: No All-Round Model, Each Model Has Its Strengths and Weaknesses

Testing 20 advanced models revealed: no single model performs excellently in all dimensions. Characteristics of different models:

  • Some models have excellent video quality but poor physical compliance;
  • Some are good at setting adherence but lack multi-round consistency;
  • Some excel in specific interaction types but are average in others. This reveals that the field still needs improvement.
7

Section 07

Open Source and Significance: Promoting the Development of Interactive World Model Technology

The WBench code and data have been open-sourced (https://github.com/meituan-longcat/WBench), providing a unified evaluation standard. Its release marks a new stage in evaluation, helping researchers understand the strengths and weaknesses of models and accelerating technological progress; it provides optimization goals for developers, and more reliable interactive video tools will be available in the future.