Reading

OmniVCHall: The First Open-Source Hallucination Evaluation Benchmark for Video Multimodal Large Models

OmniVCHall, a paper accepted by ICML 2026, is officially open-sourced. It is the first evaluation benchmark specifically targeting the compositional hallucination problem of video multimodal large models, providing an important tool for reliability assessment of video understanding models.

OmniVCHall视频多模态模型幻觉检测组合式幻觉视频理解ICML 2026评测基准抗幻觉解码

Published 2026-05-14 10:10Recent activity 2026-05-14 10:21Estimated read 8 min

OmniVCHall: The First Open-Source Hallucination Evaluation Benchmark for Video Multimodal Large Models

Section 01

[Introduction] OmniVCHall: The First Open-Source Compositional Hallucination Evaluation Benchmark for Video Multimodal Large Models

OmniVCHall, a paper accepted by ICML 2026, is officially open-sourced. It is the first systematic evaluation benchmark specifically targeting the compositional hallucination problem of video multimodal large language models (MLLMs). This benchmark fills a critical gap in the reliability assessment of video understanding models and provides an important tool for the application of video MLLMs in key scenarios such as autonomous driving and medical diagnosis.

Section 02

Hallucination Challenges in Video Understanding and Definition of Compositional Hallucination

Challenges of Video Hallucination

Hallucination problems faced by video MLLMs are more complex than those of images: they involve spatiotemporal dimensions and cross-frame relationships, leading to compositional hallucination—the model correctly identifies elements but incorrectly combines their relationships (e.g., attribute mismatch, action-subject mismatch, temporal/spatial relationship mismatch).

Hierarchical Classification of Hallucinations

Basic Hallucination: Incorrect recognition of a single element (common in image MLLMs)
Compositional Hallucination: Incorrect understanding of element relationships (highly concealed, more dangerous)
Inferential Hallucination: Incorrect reasoning based on content The core of compositional hallucination is the failure to understand relationships, making the output seem credible but containing fatal errors.

Section 03

Design of OmniVCHall: Core Components for Systematic Evaluation of Compositional Hallucination

Multi-level Hallucination Classification System

Covers four types of compositional relationships: attribute-entity, action-subject, temporal, and spatial, making evaluation results interpretable (clearly identifying hallucination types).

Adversarial Sample Construction

Positive samples: Accurate descriptions of real videos
Negative samples: Keep elements unchanged but perturb relationships (swap subjects, reverse temporal order, etc.)
Hard samples: Construct options that are close to real but incorrect

Multi-task Evaluation Protocol

Supports discriminative tasks (judging the correctness of descriptions), selection tasks (choosing correct descriptions), and generation tasks (evaluating hallucinations in generated content), adapting to the characteristics of different models.

Section 04

Key Findings: Severe Current State of Compositional Hallucination in Video MLLMs

Compositional hallucination is widespread: The accuracy of the best models is far lower than that of humans
Scale is not a panacea: Simply increasing model size has limited improvement on compositional hallucination
Temporal relationships are a weakness: Weak cross-frame reasoning ability, over-reliance on single-frame information
Attribute-entity binding is slightly better: Still has detail errors such as color and quantity These findings indicate that video MLLMs need to focus on optimizing relationship modeling rather than just pursuing scale.

Section 05

Anti-Hallucination Decoding: Practical Strategies from Evaluation to Model Improvement

OmniVCHall proposes an anti-hallucination decoding method that can reduce hallucinations without retraining:

Compositional consistency check: Verify the consistency of relationships between tokens and video/generated content during decoding
Visual anchoring mechanism: Force generated content to anchor to video visual evidence
Backtracking correction strategy: Backtrack and adjust the generation path when hallucinations are detected Experiments show that this strategy significantly reduces compositional hallucinations while maintaining fluency.

Section 06

OmniVCHall Open-Source Ecosystem and Community Plan

Open-sourced content:

Evaluation dataset (multiple video types + hallucination categories)
Standardized evaluation code and metric calculation
Interfaces for mainstream video MLLMs (e.g., Video-LLaMA, VideoChat)
Hallucination analysis visualization tool Plans: Continuously maintain the benchmark, incorporate new models/methods, establish a public leaderboard, and promote community-based research.

Section 07

Technical Insights and Future Directions for Reliability Research of Video MLLMs

Key Insights

Evaluation-driven progress: OmniVCHall fills the gap in video hallucination evaluation
Relationship understanding is core: Explicitly model element relationships
Value of decoding strategies: Post-processing optimization has low cost and quick results
Video specificity: Need to emphasize temporal relationships rather than just spatial features

Future Outlook

As video AI applications increase, reliability will become a core competitiveness. OmniVCHall lays the foundation for this direction, and we look forward to more researchers promoting the evolution of video understanding technology toward reliability and practicality.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15