Reading

VisionWeaver: Addressing Hallucination in Multimodal Large Models from the Visual Encoder Perspective

A study accepted by EMNLP 2025 Findings, which proposes to alleviate object hallucination in large vision-language models by dynamically aggregating features from multiple specialized visual encoders, and releases the VHBench-10 fine-grained evaluation benchmark as a companion.

视觉语言模型对象幻觉多专家架构动态路由VHBench-10EMNLP 2025CLIPDINOv2SAM多模态学习

Published 2026-04-09 15:39Recent activity 2026-04-09 15:45Estimated read 8 min

Section 01

VisionWeaver: Addressing Hallucination in Multimodal Large Models from the Visual Encoder Perspective (Main Floor Introduction)

VisionWeaver, a study accepted by EMNLP 2025 Findings, proposes to alleviate object hallucination in large vision-language models by dynamically aggregating features from multiple specialized visual encoders, and releases the VHBench-10 fine-grained evaluation benchmark as a companion. The core idea is to optimize from the source of visual feature extraction, using a multi-expert architecture and dynamic routing mechanism to reduce hallucinations.

Section 02

Background: The Hallucination Dilemma of Vision-Language Models

Large vision-language models (LVLMs) have made significant progress in image understanding and generation tasks, but the problem of object hallucination (describing non-existent objects/attributes) seriously affects their reliability. Traditional solutions focus on optimizing the language decoding end (such as data quality, decoding strategies, post-processing) and fail to address the root cause. The VisionWeaver team hypothesizes that different visual encoders have different inductive biases, leading to varying hallucination patterns, so we need to start from the source of visual feature extraction.

Section 03

Method: Core Innovations of VisionWeaver

Multi-Expert Visual Encoder Architecture

Instead of relying on a single encoder, it integrates multiple characteristic experts:

CLIP: Main encoder, providing global visual understanding
DINOv2: Self-supervised fine-grained feature learning
SAM: Segmentation capability, locating object boundaries
Vary: Document and text image understanding
ConvNext and EVA-02: Complementary visual representations

Dynamic Routing Mechanism

Uses CLIP's [CLS] token to generate routing signals, weighted fusion of expert features to achieve:

Adaptive selection of expert combinations (based on image type)
Global understanding guiding local fusion
Reducing single encoder bias

The core is a context-aware routing network that intelligently aggregates the advantages of multiple experts.

Section 04

Evidence: VHBench-10 Fine-Grained Hallucination Evaluation Benchmark

Dataset Composition

Contains approximately 10,000 samples with a triple structure (I, R, H):

I: Input image
R: Factually accurate description
H: Description with specific hallucinations

Ten Hallucination Categories

Divided into 4 dimensions and 10 subclasses: Detection: Color recognition, Shape recognition Segmentation: Object counting, Attribute description Localization: Relative position, Absolute position Classification: Object recognition, Text recognition, Scene understanding, Action recognition

Data Generation

Hallucination descriptions are generated by GPT-4o, with targeted testing of each subclass via prompt engineering, and controlled injection of hallucinations to locate model defects.

Section 05

Evidence: Technical Implementation and Experimental Setup

Based on the LLaVA-1.5 architecture, supports Qwen and LLaMA series language models, and open-sources training/inference code.

Environment Configuration

Python3.12
PyTorch2.9.1/torchvision0.24.1
Transformers4.57.3
DeepSpeed0.15.4 (distributed training)

Training Process

Provides pre-training/fine-tuning scripts, supports Qwen3B and LLaMA3B models; users can run by updating configuration files (data/model/output paths).

Section 06

Research Significance and Implications

Value of visual end optimization: Proves the effectiveness of optimizing from the source of visual feature extraction, making up for the shortcomings of traditional language-end focused solutions.
Potential of multi-expert architecture: The success of dynamic routing and multi-expert fusion in cross-modal tasks expands the ideas of MoE (Mixture of Experts).
Necessity of fine-grained evaluation: The 10-category classification system of VHBench-10 provides a systematic evaluation framework to facilitate precise improvements.
Power of open-source collaboration: Integrates open-source encoders such as CLIP and DINOv2, reflecting community collaborative innovation.

Section 07

Summary and Outlook

As a work accepted by EMNLP 2025 Findings, VisionWeaver provides a novel and effective solution to alleviate LVLM hallucinations, improves accuracy through multi-expert feature aggregation, and offers a new perspective on understanding the visual roots of hallucinations.

VHBench-10 provides the community with a fine-grained evaluation tool to promote systematic research. As LVLMs are applied in fields such as medical care and autonomous driving, solving the hallucination problem becomes increasingly important. VisionWeaver's ideas and open-source implementation will provide references for future exploration.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15