正文

OmniVCHall：诊断视频多模态大模型组合幻觉的综合基准

ICML 2026 接收的突破性研究，提出首个系统性诊断视频多模态大模型组合幻觉的基准数据集，并配套 TriCD 解码框架，无需微调即可显著提升模型鲁棒性。

视频多模态大模型幻觉检测组合推理ICML 2026对比解码VLLM基准测试机器学习

发布时间 2026/05/14 09:52最近活动 2026/05/14 10:01预计阅读 6 分钟

章节 01

OmniVCHall: A Comprehensive Benchmark for Diagnosing Compositional Hallucinations in Video Multimodal LLMs

This is an ICML 2026 accepted study presenting the first systematic benchmark (OmniVCHall) for diagnosing compositional hallucinations in video multimodal large language models (VLLMs). It also introduces the TriCD decoding framework, which can significantly improve model robustness without fine-tuning. Key focus areas include evaluating VLLMs' performance on combined visual evidence reasoning and addressing hallucination issues in complex video scenarios.

章节 02

Research Background: The Problem of Compositional Hallucination in VLLMs

Video multimodal LLMs (VLLMs) have made progress in understanding complex video content but suffer from hallucinations (answers without content support). Existing benchmarks focus on single-type errors (e.g., wrong actions, time confusion), but real-world scenarios require joint reasoning over multiple visual evidence (object, action, time, camera motion, etc.)—this is called 'compositional hallucination', a major challenge for current VLLMs.

章节 03

OmniVCHall Benchmark: Dataset & Design

OmniVCHall is the first benchmark for compositional hallucination. It includes:

Dataset: 823 videos (real + AI-generated) with 9,027 QA pairs (public on Hugging Face).
8 Hallucination Types: Object, Scene, Event, Action, Relation, Attribute, Temporal, Camera (newly introduced).
Dual Test Structure: Single-type (one evidence) and Compositional (multiple evidences) queries, with Yes/No and Multiple-choice QA formats.

章节 04

Key Findings from Benchmark Evaluation

Evaluation of 39 mainstream VLLMs shows:

Performance drops significantly when shifting from single-type to compositional queries (even top models).
Camera motion reasoning is particularly hard: models often confuse lens movement (zoom/pan) with object motion, revealing flaws in visual grounding mechanisms.

章节 05

TriCD: Plug-and-Play Decoding Framework for Anti-Hallucination

TriCD (Triple-path Contrastive Decoding) is a no-fine-tuning framework to boost VLLM robustness:

Three Paths:

Original: Standard model logits.
Negative: Adaptive perturbation (APC) to expose hallucination paths.
Positive: Saliency-guided enhancement (SGE) using DINOv3's spatial/temporal cues to reinforce evidence-supported predictions.

Calibration Formula: q_t = q_t^o + α₁(q_t^p - q_t^o) + α₂(q_t^o - q_t^n) (encourages evidence-supported answers, suppresses hallucinations).

章节 06

Experimental Results of TriCD

TriCD shows strong results:

Improves average accuracy of representative VLLMs by over 10 percentage points (both Yes/No and Multiple-choice).
Corrects camera motion confusion (lens vs object movement).
Handles tricky questions (e.g., adversarial options like 'all correct'/'none').

章节 07

Technical Implementation & Usage

Project code is available with setup steps:

Create environment: conda env create -f environment.yml then conda activate videoproject.
Smoke test: bash vcd/train/run_smoke_fast5_llavanv.sh.
Full training: bash vcd/train/run_fast5_subset1800_llavanv_1epoch.sh.

章节 08

Conclusion & Future Outlook

OmniVCHall and TriCD open new directions for VLLM hallucination research:

Provides a standardized benchmark for compositional hallucination.
Offers a cost-effective way (no fine-tuning) to improve model reliability.
Valuable for video understanding, multimodal learning, and AI safety.
Future work: Solving compositional hallucination to build trustworthy visual AI systems as video content grows in AI applications.