# OmniVCHall: The First Open-Source Hallucination Evaluation Benchmark for Video Multimodal Large Models

> OmniVCHall, a paper accepted by ICML 2026, is officially open-sourced. It is the first evaluation benchmark specifically targeting the compositional hallucination problem of video multimodal large models, providing an important tool for reliability assessment of video understanding models.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-14T02:10:24.000Z
- 最近活动: 2026-05-14T02:21:07.024Z
- 热度: 150.8
- 关键词: OmniVCHall, 视频多模态模型, 幻觉检测, 组合式幻觉, 视频理解, ICML 2026, 评测基准, 抗幻觉解码
- 页面链接: https://www.zingnex.cn/en/forum/thread/omnivchall-a7958e78
- Canonical: https://www.zingnex.cn/forum/thread/omnivchall-a7958e78
- Markdown 来源: floors_fallback

---

## [Introduction] OmniVCHall: The First Open-Source Compositional Hallucination Evaluation Benchmark for Video Multimodal Large Models

OmniVCHall, a paper accepted by ICML 2026, is officially open-sourced. It is the first systematic evaluation benchmark specifically targeting the compositional hallucination problem of video multimodal large language models (MLLMs). This benchmark fills a critical gap in the reliability assessment of video understanding models and provides an important tool for the application of video MLLMs in key scenarios such as autonomous driving and medical diagnosis.

## Hallucination Challenges in Video Understanding and Definition of Compositional Hallucination

### Challenges of Video Hallucination
Hallucination problems faced by video MLLMs are more complex than those of images: they involve spatiotemporal dimensions and cross-frame relationships, leading to **compositional hallucination**—the model correctly identifies elements but incorrectly combines their relationships (e.g., attribute mismatch, action-subject mismatch, temporal/spatial relationship mismatch).
### Hierarchical Classification of Hallucinations
- **Basic Hallucination**: Incorrect recognition of a single element (common in image MLLMs)
- **Compositional Hallucination**: Incorrect understanding of element relationships (highly concealed, more dangerous)
- **Inferential Hallucination**: Incorrect reasoning based on content
The core of compositional hallucination is the failure to understand relationships, making the output seem credible but containing fatal errors.

## Design of OmniVCHall: Core Components for Systematic Evaluation of Compositional Hallucination

### Multi-level Hallucination Classification System
Covers four types of compositional relationships: attribute-entity, action-subject, temporal, and spatial, making evaluation results interpretable (clearly identifying hallucination types).
### Adversarial Sample Construction
1. Positive samples: Accurate descriptions of real videos
2. Negative samples: Keep elements unchanged but perturb relationships (swap subjects, reverse temporal order, etc.)
3. Hard samples: Construct options that are close to real but incorrect
### Multi-task Evaluation Protocol
Supports discriminative tasks (judging the correctness of descriptions), selection tasks (choosing correct descriptions), and generation tasks (evaluating hallucinations in generated content), adapting to the characteristics of different models.

## Key Findings: Severe Current State of Compositional Hallucination in Video MLLMs

- **Compositional hallucination is widespread**: The accuracy of the best models is far lower than that of humans
- **Scale is not a panacea**: Simply increasing model size has limited improvement on compositional hallucination
- **Temporal relationships are a weakness**: Weak cross-frame reasoning ability, over-reliance on single-frame information
- **Attribute-entity binding is slightly better**: Still has detail errors such as color and quantity
These findings indicate that video MLLMs need to focus on optimizing relationship modeling rather than just pursuing scale.

## Anti-Hallucination Decoding: Practical Strategies from Evaluation to Model Improvement

OmniVCHall proposes an **anti-hallucination decoding** method that can reduce hallucinations without retraining:
- **Compositional consistency check**: Verify the consistency of relationships between tokens and video/generated content during decoding
- **Visual anchoring mechanism**: Force generated content to anchor to video visual evidence
- **Backtracking correction strategy**: Backtrack and adjust the generation path when hallucinations are detected
Experiments show that this strategy significantly reduces compositional hallucinations while maintaining fluency.

## OmniVCHall Open-Source Ecosystem and Community Plan

Open-sourced content:
- Evaluation dataset (multiple video types + hallucination categories)
- Standardized evaluation code and metric calculation
- Interfaces for mainstream video MLLMs (e.g., Video-LLaMA, VideoChat)
- Hallucination analysis visualization tool
Plans: Continuously maintain the benchmark, incorporate new models/methods, establish a public leaderboard, and promote community-based research.

## Technical Insights and Future Directions for Reliability Research of Video MLLMs

### Key Insights
1. Evaluation-driven progress: OmniVCHall fills the gap in video hallucination evaluation
2. Relationship understanding is core: Explicitly model element relationships
3. Value of decoding strategies: Post-processing optimization has low cost and quick results
4. Video specificity: Need to emphasize temporal relationships rather than just spatial features
### Future Outlook
As video AI applications increase, reliability will become a core competitiveness. OmniVCHall lays the foundation for this direction, and we look forward to more researchers promoting the evolution of video understanding technology toward reliability and practicality.
