Reading

WBench: A Comprehensive Multi-Round Benchmark for Evaluating Interactive Video World Models

The Meituan team has launched the WBench benchmark, which covers 289 test cases and 1058 interaction rounds, and comprehensively evaluates interactive video world models from five dimensions: video quality, setting adherence, interaction adherence, consistency, and physical compliance.

世界模型视频生成基准测试多模态评估交互式AI美团

Published 2026-05-25 22:01Recent activity 2026-05-26 13:48Estimated read 5 min

WBench: A Comprehensive Multi-Round Benchmark for Evaluating Interactive Video World Models

Section 01

WBench: Introduction to the Comprehensive Multi-Round Benchmark for Evaluating Interactive Video World Models

The Meituan team has launched the WBench benchmark, aiming to comprehensively evaluate interactive video world models. This benchmark covers 289 test cases and 1058 interaction rounds, and assesses models from five dimensions: video quality, setting adherence, interaction adherence, consistency, and physical compliance. The code and data have been open-sourced (GitHub link: https://github.com/meituan-longcat/WBench), providing a unified evaluation standard for academia and industry.

Section 02

Background: Three Major Challenges in Existing Interactive World Model Evaluation

Interactive world models have broad application prospects in fields such as games and film/television, but existing evaluations have shortcomings:

Fragmented evaluation dimensions, lack of a unified framework;
Lack of multi-round interaction tests, making it difficult to simulate real scenarios;
Ununified control methods, making fair comparison between models difficult.

Section 03

WBench Core Design: Five Key Evaluation Dimensions

WBench evaluates models from five dimensions:

Video Quality: Clarity, coherence, realism;
Setting Adherence: Accurately understanding settings such as scenes, styles, and subjects;
Interaction Adherence: Executing instructions and memorizing history during multi-round interactions;
Consistency: Stability of subjects, scenes, and time across rounds;
Physical Compliance: Conforming to physical laws such as gravity and collision.

Section 04

WBench Test Dataset and Interaction Types

The dataset contains 289 test cases and 1058 interaction rounds, covering diversity in scenes (indoor/outdoor, etc.), styles (realistic/cartoon, etc.), subjects (humans/animals, etc.), and perspectives (first/third person). Interaction types include four categories: navigation, subject action, event editing, and perspective switching. The navigation task unifies three control methods: text control, 6-degree-of-freedom pose, and discrete actions to ensure fair comparison.

Section 05

WBench Evaluation Method: 22 Automatic Sub-indicators

WBench uses 22 automatic sub-indicators for evaluation:

Combining computer vision models to assess video quality, object detection, etc.;
Using large multi-modal models to judge semantic understanding and consistency;
All indicators are verified by manual annotation to ensure consistency with human judgment.

Section 06

Key Findings: No All-Round Model, Each Model Has Its Strengths and Weaknesses

Testing 20 advanced models revealed: no single model performs excellently in all dimensions. Characteristics of different models:

Some models have excellent video quality but poor physical compliance;
Some are good at setting adherence but lack multi-round consistency;
Some excel in specific interaction types but are average in others. This reveals that the field still needs improvement.

Section 07

Open Source and Significance: Promoting the Development of Interactive World Model Technology

The WBench code and data have been open-sourced (https://github.com/meituan-longcat/WBench), providing a unified evaluation standard. Its release marks a new stage in evaluation, helping researchers understand the strengths and weaknesses of models and accelerating technological progress; it provides optimization goals for developers, and more reliable interactive video tools will be available in the future.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15