Zing Forum

Reading

OmniVCHall: A Comprehensive Benchmark for Diagnosing Compositional Hallucinations in Video Multimodal Large Models

A breakthrough study accepted by ICML 2026 proposes the first systematic benchmark dataset for diagnosing compositional hallucinations in video multimodal large models, along with the TriCD decoding framework, which can significantly improve model robustness without fine-tuning.

视频多模态大模型幻觉检测组合推理ICML 2026对比解码VLLM基准测试机器学习
Published 2026-05-14 09:52Recent activity 2026-05-14 10:01Estimated read 6 min
OmniVCHall: A Comprehensive Benchmark for Diagnosing Compositional Hallucinations in Video Multimodal Large Models
1

Section 01

OmniVCHall: A Comprehensive Benchmark for Diagnosing Compositional Hallucinations in Video Multimodal LLMs

This is an ICML 2026 accepted study presenting the first systematic benchmark (OmniVCHall) for diagnosing compositional hallucinations in video multimodal large language models (VLLMs). It also introduces the TriCD decoding framework, which can significantly improve model robustness without fine-tuning. Key focus areas include evaluating VLLMs' performance on combined visual evidence reasoning and addressing hallucination issues in complex video scenarios.

2

Section 02

Research Background: The Problem of Compositional Hallucination in VLLMs

Video multimodal LLMs (VLLMs) have made progress in understanding complex video content but suffer from hallucinations (answers without content support). Existing benchmarks focus on single-type errors (e.g., wrong actions, time confusion), but real-world scenarios require joint reasoning over multiple visual evidence (object, action, time, camera motion, etc.)—this is called 'compositional hallucination', a major challenge for current VLLMs.

3

Section 03

OmniVCHall Benchmark: Dataset & Design

OmniVCHall is the first benchmark for compositional hallucination. It includes:

  • Dataset: 823 videos (real + AI-generated) with 9,027 QA pairs (public on Hugging Face).
  • 8 Hallucination Types: Object, Scene, Event, Action, Relation, Attribute, Temporal, Camera (newly introduced).
  • Dual Test Structure: Single-type (one evidence) and Compositional (multiple evidences) queries, with Yes/No and Multiple-choice QA formats.
4

Section 04

Key Findings from Benchmark Evaluation

Evaluation of 39 mainstream VLLMs shows:

  • Performance drops significantly when shifting from single-type to compositional queries (even top models).
  • Camera motion reasoning is particularly hard: models often confuse lens movement (zoom/pan) with object motion, revealing flaws in visual grounding mechanisms.
5

Section 05

TriCD: Plug-and-Play Decoding Framework for Anti-Hallucination

TriCD (Triple-path Contrastive Decoding) is a no-fine-tuning framework to boost VLLM robustness:

  • Three Paths:
  1. Original: Standard model logits.
  2. Negative: Adaptive perturbation (APC) to expose hallucination paths.
  3. Positive: Saliency-guided enhancement (SGE) using DINOv3's spatial/temporal cues to reinforce evidence-supported predictions.
  • Calibration Formula: q_t = q_t^o + α₁(q_t^p - q_t^o) + α₂(q_t^o - q_t^n) (encourages evidence-supported answers, suppresses hallucinations).
6

Section 06

Experimental Results of TriCD

TriCD shows strong results:

  • Improves average accuracy of representative VLLMs by over 10 percentage points (both Yes/No and Multiple-choice).
  • Corrects camera motion confusion (lens vs object movement).
  • Handles tricky questions (e.g., adversarial options like 'all correct'/'none').
7

Section 07

Technical Implementation & Usage

Project code is available with setup steps:

  1. Create environment: conda env create -f environment.yml then conda activate videoproject.
  2. Smoke test: bash vcd/train/run_smoke_fast5_llavanv.sh.
  3. Full training: bash vcd/train/run_fast5_subset1800_llavanv_1epoch.sh.
8

Section 08

Conclusion & Future Outlook

OmniVCHall and TriCD open new directions for VLLM hallucination research:

  • Provides a standardized benchmark for compositional hallucination.
  • Offers a cost-effective way (no fine-tuning) to improve model reliability.
  • Valuable for video understanding, multimodal learning, and AI safety.
  • Future work: Solving compositional hallucination to build trustworthy visual AI systems as video content grows in AI applications.