Reading

OmniVCHall: A Comprehensive Benchmark for Diagnosing Compositional Hallucinations in Video Multimodal Large Models

A breakthrough study accepted by ICML 2026 proposes the first systematic benchmark dataset for diagnosing compositional hallucinations in video multimodal large models, along with the TriCD decoding framework, which can significantly improve model robustness without fine-tuning.

视频多模态大模型幻觉检测组合推理ICML 2026对比解码VLLM基准测试机器学习

Published 2026-05-14 09:52Recent activity 2026-05-14 10:01Estimated read 6 min

OmniVCHall: A Comprehensive Benchmark for Diagnosing Compositional Hallucinations in Video Multimodal Large Models

Section 01

OmniVCHall: A Comprehensive Benchmark for Diagnosing Compositional Hallucinations in Video Multimodal LLMs

This is an ICML 2026 accepted study presenting the first systematic benchmark (OmniVCHall) for diagnosing compositional hallucinations in video multimodal large language models (VLLMs). It also introduces the TriCD decoding framework, which can significantly improve model robustness without fine-tuning. Key focus areas include evaluating VLLMs' performance on combined visual evidence reasoning and addressing hallucination issues in complex video scenarios.

Section 02

Research Background: The Problem of Compositional Hallucination in VLLMs

Video multimodal LLMs (VLLMs) have made progress in understanding complex video content but suffer from hallucinations (answers without content support). Existing benchmarks focus on single-type errors (e.g., wrong actions, time confusion), but real-world scenarios require joint reasoning over multiple visual evidence (object, action, time, camera motion, etc.)—this is called 'compositional hallucination', a major challenge for current VLLMs.

Section 03

OmniVCHall Benchmark: Dataset & Design

OmniVCHall is the first benchmark for compositional hallucination. It includes:

Dataset: 823 videos (real + AI-generated) with 9,027 QA pairs (public on Hugging Face).
8 Hallucination Types: Object, Scene, Event, Action, Relation, Attribute, Temporal, Camera (newly introduced).
Dual Test Structure: Single-type (one evidence) and Compositional (multiple evidences) queries, with Yes/No and Multiple-choice QA formats.

Section 04

Key Findings from Benchmark Evaluation

Evaluation of 39 mainstream VLLMs shows:

Performance drops significantly when shifting from single-type to compositional queries (even top models).
Camera motion reasoning is particularly hard: models often confuse lens movement (zoom/pan) with object motion, revealing flaws in visual grounding mechanisms.

Section 05

TriCD: Plug-and-Play Decoding Framework for Anti-Hallucination

TriCD (Triple-path Contrastive Decoding) is a no-fine-tuning framework to boost VLLM robustness:

Three Paths:

Original: Standard model logits.
Negative: Adaptive perturbation (APC) to expose hallucination paths.
Positive: Saliency-guided enhancement (SGE) using DINOv3's spatial/temporal cues to reinforce evidence-supported predictions.

Calibration Formula: q_t = q_t^o + α₁(q_t^p - q_t^o) + α₂(q_t^o - q_t^n) (encourages evidence-supported answers, suppresses hallucinations).

Section 06

Experimental Results of TriCD

TriCD shows strong results:

Improves average accuracy of representative VLLMs by over 10 percentage points (both Yes/No and Multiple-choice).
Corrects camera motion confusion (lens vs object movement).
Handles tricky questions (e.g., adversarial options like 'all correct'/'none').

Section 07

Technical Implementation & Usage

Project code is available with setup steps:

Create environment: conda env create -f environment.yml then conda activate videoproject.
Smoke test: bash vcd/train/run_smoke_fast5_llavanv.sh.
Full training: bash vcd/train/run_fast5_subset1800_llavanv_1epoch.sh.

Section 08

Conclusion & Future Outlook

OmniVCHall and TriCD open new directions for VLLM hallucination research:

Provides a standardized benchmark for compositional hallucination.
Offers a cost-effective way (no fine-tuning) to improve model reliability.
Valuable for video understanding, multimodal learning, and AI safety.
Future work: Solving compositional hallucination to build trustworthy visual AI systems as video content grows in AI applications.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54