Zing Forum

Reading

VisionWeaver: Addressing Hallucination in Multimodal Large Models from the Visual Encoder Perspective

A study accepted by EMNLP 2025 Findings, which proposes to alleviate object hallucination in large vision-language models by dynamically aggregating features from multiple specialized visual encoders, and releases the VHBench-10 fine-grained evaluation benchmark as a companion.

视觉语言模型对象幻觉多专家架构动态路由VHBench-10EMNLP 2025CLIPDINOv2SAM多模态学习
Published 2026-04-09 15:39Recent activity 2026-04-09 15:45Estimated read 8 min
VisionWeaver: Addressing Hallucination in Multimodal Large Models from the Visual Encoder Perspective
1

Section 01

VisionWeaver: Addressing Hallucination in Multimodal Large Models from the Visual Encoder Perspective (Main Floor Introduction)

VisionWeaver, a study accepted by EMNLP 2025 Findings, proposes to alleviate object hallucination in large vision-language models by dynamically aggregating features from multiple specialized visual encoders, and releases the VHBench-10 fine-grained evaluation benchmark as a companion. The core idea is to optimize from the source of visual feature extraction, using a multi-expert architecture and dynamic routing mechanism to reduce hallucinations.

2

Section 02

Background: The Hallucination Dilemma of Vision-Language Models

Large vision-language models (LVLMs) have made significant progress in image understanding and generation tasks, but the problem of object hallucination (describing non-existent objects/attributes) seriously affects their reliability. Traditional solutions focus on optimizing the language decoding end (such as data quality, decoding strategies, post-processing) and fail to address the root cause. The VisionWeaver team hypothesizes that different visual encoders have different inductive biases, leading to varying hallucination patterns, so we need to start from the source of visual feature extraction.

3

Section 03

Method: Core Innovations of VisionWeaver

Multi-Expert Visual Encoder Architecture

Instead of relying on a single encoder, it integrates multiple characteristic experts:

  • CLIP: Main encoder, providing global visual understanding
  • DINOv2: Self-supervised fine-grained feature learning
  • SAM: Segmentation capability, locating object boundaries
  • Vary: Document and text image understanding
  • ConvNext and EVA-02: Complementary visual representations

Dynamic Routing Mechanism

Uses CLIP's [CLS] token to generate routing signals, weighted fusion of expert features to achieve:

  1. Adaptive selection of expert combinations (based on image type)
  2. Global understanding guiding local fusion
  3. Reducing single encoder bias

The core is a context-aware routing network that intelligently aggregates the advantages of multiple experts.

4

Section 04

Evidence: VHBench-10 Fine-Grained Hallucination Evaluation Benchmark

Dataset Composition

Contains approximately 10,000 samples with a triple structure (I, R, H):

  • I: Input image
  • R: Factually accurate description
  • H: Description with specific hallucinations

Ten Hallucination Categories

Divided into 4 dimensions and 10 subclasses: Detection: Color recognition, Shape recognition Segmentation: Object counting, Attribute description Localization: Relative position, Absolute position Classification: Object recognition, Text recognition, Scene understanding, Action recognition

Data Generation

Hallucination descriptions are generated by GPT-4o, with targeted testing of each subclass via prompt engineering, and controlled injection of hallucinations to locate model defects.

5

Section 05

Evidence: Technical Implementation and Experimental Setup

Based on the LLaVA-1.5 architecture, supports Qwen and LLaMA series language models, and open-sources training/inference code.

Environment Configuration

  • Python3.12
  • PyTorch2.9.1/torchvision0.24.1
  • Transformers4.57.3
  • DeepSpeed0.15.4 (distributed training)

Training Process

Provides pre-training/fine-tuning scripts, supports Qwen3B and LLaMA3B models; users can run by updating configuration files (data/model/output paths).

6

Section 06

Research Significance and Implications

  1. Value of visual end optimization: Proves the effectiveness of optimizing from the source of visual feature extraction, making up for the shortcomings of traditional language-end focused solutions.
  2. Potential of multi-expert architecture: The success of dynamic routing and multi-expert fusion in cross-modal tasks expands the ideas of MoE (Mixture of Experts).
  3. Necessity of fine-grained evaluation: The 10-category classification system of VHBench-10 provides a systematic evaluation framework to facilitate precise improvements.
  4. Power of open-source collaboration: Integrates open-source encoders such as CLIP and DINOv2, reflecting community collaborative innovation.
7

Section 07

Summary and Outlook

As a work accepted by EMNLP 2025 Findings, VisionWeaver provides a novel and effective solution to alleviate LVLM hallucinations, improves accuracy through multi-expert feature aggregation, and offers a new perspective on understanding the visual roots of hallucinations.

VHBench-10 provides the community with a fine-grained evaluation tool to promote systematic research. As LVLMs are applied in fields such as medical care and autonomous driving, solving the hallucination problem becomes increasingly important. VisionWeaver's ideas and open-source implementation will provide references for future exploration.