Zing Forum

Reading

OmniVCHall: The First Open-Source Hallucination Evaluation Benchmark for Video Multimodal Large Models

OmniVCHall, a paper accepted by ICML 2026, is officially open-sourced. It is the first evaluation benchmark specifically targeting the compositional hallucination problem of video multimodal large models, providing an important tool for reliability assessment of video understanding models.

OmniVCHall视频多模态模型幻觉检测组合式幻觉视频理解ICML 2026评测基准抗幻觉解码
Published 2026-05-14 10:10Recent activity 2026-05-14 10:21Estimated read 8 min
OmniVCHall: The First Open-Source Hallucination Evaluation Benchmark for Video Multimodal Large Models
1

Section 01

[Introduction] OmniVCHall: The First Open-Source Compositional Hallucination Evaluation Benchmark for Video Multimodal Large Models

OmniVCHall, a paper accepted by ICML 2026, is officially open-sourced. It is the first systematic evaluation benchmark specifically targeting the compositional hallucination problem of video multimodal large language models (MLLMs). This benchmark fills a critical gap in the reliability assessment of video understanding models and provides an important tool for the application of video MLLMs in key scenarios such as autonomous driving and medical diagnosis.

2

Section 02

Hallucination Challenges in Video Understanding and Definition of Compositional Hallucination

Challenges of Video Hallucination

Hallucination problems faced by video MLLMs are more complex than those of images: they involve spatiotemporal dimensions and cross-frame relationships, leading to compositional hallucination—the model correctly identifies elements but incorrectly combines their relationships (e.g., attribute mismatch, action-subject mismatch, temporal/spatial relationship mismatch).

Hierarchical Classification of Hallucinations

  • Basic Hallucination: Incorrect recognition of a single element (common in image MLLMs)
  • Compositional Hallucination: Incorrect understanding of element relationships (highly concealed, more dangerous)
  • Inferential Hallucination: Incorrect reasoning based on content The core of compositional hallucination is the failure to understand relationships, making the output seem credible but containing fatal errors.
3

Section 03

Design of OmniVCHall: Core Components for Systematic Evaluation of Compositional Hallucination

Multi-level Hallucination Classification System

Covers four types of compositional relationships: attribute-entity, action-subject, temporal, and spatial, making evaluation results interpretable (clearly identifying hallucination types).

Adversarial Sample Construction

  1. Positive samples: Accurate descriptions of real videos
  2. Negative samples: Keep elements unchanged but perturb relationships (swap subjects, reverse temporal order, etc.)
  3. Hard samples: Construct options that are close to real but incorrect

Multi-task Evaluation Protocol

Supports discriminative tasks (judging the correctness of descriptions), selection tasks (choosing correct descriptions), and generation tasks (evaluating hallucinations in generated content), adapting to the characteristics of different models.

4

Section 04

Key Findings: Severe Current State of Compositional Hallucination in Video MLLMs

  • Compositional hallucination is widespread: The accuracy of the best models is far lower than that of humans
  • Scale is not a panacea: Simply increasing model size has limited improvement on compositional hallucination
  • Temporal relationships are a weakness: Weak cross-frame reasoning ability, over-reliance on single-frame information
  • Attribute-entity binding is slightly better: Still has detail errors such as color and quantity These findings indicate that video MLLMs need to focus on optimizing relationship modeling rather than just pursuing scale.
5

Section 05

Anti-Hallucination Decoding: Practical Strategies from Evaluation to Model Improvement

OmniVCHall proposes an anti-hallucination decoding method that can reduce hallucinations without retraining:

  • Compositional consistency check: Verify the consistency of relationships between tokens and video/generated content during decoding
  • Visual anchoring mechanism: Force generated content to anchor to video visual evidence
  • Backtracking correction strategy: Backtrack and adjust the generation path when hallucinations are detected Experiments show that this strategy significantly reduces compositional hallucinations while maintaining fluency.
6

Section 06

OmniVCHall Open-Source Ecosystem and Community Plan

Open-sourced content:

  • Evaluation dataset (multiple video types + hallucination categories)
  • Standardized evaluation code and metric calculation
  • Interfaces for mainstream video MLLMs (e.g., Video-LLaMA, VideoChat)
  • Hallucination analysis visualization tool Plans: Continuously maintain the benchmark, incorporate new models/methods, establish a public leaderboard, and promote community-based research.
7

Section 07

Technical Insights and Future Directions for Reliability Research of Video MLLMs

Key Insights

  1. Evaluation-driven progress: OmniVCHall fills the gap in video hallucination evaluation
  2. Relationship understanding is core: Explicitly model element relationships
  3. Value of decoding strategies: Post-processing optimization has low cost and quick results
  4. Video specificity: Need to emphasize temporal relationships rather than just spatial features

Future Outlook

As video AI applications increase, reliability will become a core competitiveness. OmniVCHall lays the foundation for this direction, and we look forward to more researchers promoting the evolution of video understanding technology toward reliability and practicality.