Zing Forum

Reading

VEC-DPO: Visual Evidence Calibration Technology Mitigates Hallucination in Multimodal Large Models

VEC-DPO is a hallucination mitigation method for multimodal large language models (MLLMs), which effectively reduces hallucinations in image understanding tasks through visual evidence calibration technology.

多模态大模型幻觉缓解视觉证据校准DPOMLLM视觉问答AI可解释性
Published 2026-06-02 20:10Recent activity 2026-06-02 20:26Estimated read 9 min
VEC-DPO: Visual Evidence Calibration Technology Mitigates Hallucination in Multimodal Large Models
1

Section 01

VEC-DPO: Visual Evidence Calibration Technology Mitigates Hallucination in Multimodal Large Models

Core Insights: VEC-DPO (Visual Evidence Calibration Direct Preference Optimization) is a hallucination mitigation method for multimodal large language models (MLLMs). It guides the model to rely on the actual content of images through explicit visual evidence calibration, thereby reducing hallucinations. Original Author/Maintainer: wwoww1 Source Platform: GitHub Original Link: https://github.com/wwoww1/VEC-DPO Publication Date: 2026-06-02 Related Paper: "Visual Evidence Calibration for Hallucination Mitigation in Multimodal Large Language Models" This thread will introduce the background, method, experimental results, application value, limitations, and future directions in separate floors.

2

Section 02

Background: The Hallucination Dilemma of Multimodal Large Models

Multimodal large language models (e.g., GPT-4V, Gemini, LLaVA) suffer from severe hallucination issues: the generated content does not match the actual image. Types of Hallucinations:

  • Object Hallucination: Claiming non-existent objects
  • Attribute Hallucination: Incorrectly describing color/shape/position, etc.
  • Relationship Hallucination: Misunderstanding spatial or interactive relationships between objects
  • Count Hallucination: Incorrectly reporting quantities Causes:
  1. Overly strong language priors: Relying on language patterns rather than visual information
  2. Insufficient visual-language alignment: Distorted information transfer
  3. Training data noise: Learning from incorrectly labeled data
  4. Limitations of attention mechanisms: Ignoring important visual cues
3

Section 03

Core Innovations of the VEC-DPO Method

VEC-DPO reduces hallucinations through explicit visual evidence calibration, with core innovations including:

  1. Visual Evidence Extraction Mechanism: When generating answers, the model must label the image regions it relies on (bounding boxes/segmentation masks/heatmaps/text descriptions), improving interpretability and providing supervision signals.
  2. Improved DPO Framework:
    • Preference Data: Includes images, questions, preferred answers (correct + evidence) and non-preferred answers (hallucinatory + inconsistent evidence)
    • Evidence Consistency Constraint: The optimization objective ensures that the answer matches the cited region
  3. Composite Loss Function: Preference loss (encourages correct answers) + evidence alignment loss (measures the matching degree between evidence and images) + consistency regularization (semantic consistency between text and evidence)
4

Section 04

Experimental Results and Performance Analysis

Benchmark Tests: POPE (Object Hallucination), MME (Comprehensive Ability), LLaVA-Bench (Open-Domain QA) Key Findings:

  • Significant reduction in hallucinations: Object hallucination rate decreased by 30-50% on the POPE benchmark
  • Preservation of general capabilities: Performance on standard VQA tasks remains stable or slightly improved
  • Improved evidence quality: Generated visual evidence is more accurate and relevant
  • Cross-model transferability: Applicable to architectures like LLaVA-1.5 and InstructBLIP Ablation Experiments: The full VEC-DPO achieves the best results, with evidence supervision and preference optimization complementing each other.
5

Section 05

Practical Applications and Method Comparison

Application Value:

  • Medical Imaging: Accurately identify lesions and provide interpretable reports
  • Autonomous Driving: Reduce misjudgment of obstacles and enhance robustness
  • Content Moderation: Accurately identify violating content and meet interpretability requirements
  • Assistive Technology: Provide reliable scene descriptions for visually impaired individuals Comparison with Other Methods:
  • Superior to data cleaning (does not rely on data), post-processing (low inference overhead), and contrastive learning (finer-grained evidence alignment)
  • Extends standard DPO: Adds visual evidence calibration, making it more multi-dimensional
6

Section 06

Limitations and Future Work

Limitations:

  1. High cost of evidence annotation: Manual annotation is expensive
  2. Weak handling of complex scenes: Imprecise evidence localization in crowded/occluded/low-quality images
  3. Limited fine-grained hallucination detection: Effectiveness for attribute/relationship-level hallucinations needs improvement
  4. Real-time challenges: Generating evidence increases computational overhead Future Directions:
  • Explore self-supervised/weakly supervised evidence generation
  • Enhance the robustness of visual encoders and introduce multi-scale evidence
  • Design fine-grained evidence representations and integrate common sense reasoning
  • Optimize model lightweighting and hardware acceleration
7

Section 07

Open-Source Contributions and Conclusion

Open-Source Value:

  • Reproducibility: Open code facilitates result verification
  • Benchmark Tools: Provides hallucination evaluation tools
  • Extension Foundation: Supports developers in exploring new variants
  • Educational Value: Serves as a teaching case for multimodal alignment and hallucination mitigation Conclusion: VEC-DPO pioneers the training paradigm of "teaching models to present evidence", improving accuracy and interpretability. In the future, models integrating explicit evidence mechanisms will be more transparent and trustworthy, promoting the application of multimodal AI in key fields.