Reading

M³-VQA: A New Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

M³-VQA is a brand-new knowledge-based visual question answering benchmark focusing on fine-grained multimodal entity understanding and complex multi-hop reasoning, filling the gap in existing VQA datasets regarding multi-entity reasoning.

视觉问答多模态多跳推理基准测试大语言模型知识检索实体理解

Published 2026-04-28 09:57Recent activity 2026-04-29 12:31Estimated read 8 min

M³-VQA: A New Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

Section 01

Introduction: Core Overview of the M³-VQA New Benchmark

M³-VQA is a knowledge-based visual question answering benchmark designed for Multimodal Large Language Models (MLLMs), focusing on fine-grained multi-entity understanding and complex multi-hop reasoning, filling the gap in existing VQA datasets regarding multi-entity reasoning. This article will introduce it from dimensions such as background, dataset design, evaluation framework, research findings, contributions and limitations, and application implications, providing a more rigorous testing platform for MLLM research.

Section 02

Research Background: Limitations of Existing VQA Benchmarks and Real-World Needs

Limitations of Existing VQA Benchmarks

Visual question answering benchmarks generally have three major problems:

Coarse-grained category focus: Only focuses on macro category recognition, lacking fine-grained entity features and relationship understanding;
Single-entity reasoning: Questions revolve around a single entity, unable to evaluate multi-entity processing capabilities;
Lack of knowledge integration: Relies on the image itself, no need for external knowledge or cross-document reasoning, which is disconnected from real scenarios.

Complexity of the Real World

Actual problems often involve multi-entity relationships, cross-modal information integration, and multi-step reasoning, and M³-VQA is designed to fill this evaluation gap.

Section 03

Dataset Design: Core Features and Construction Process of Multimodal, Multi-Entity, Multi-Hop

Three Core Features

Multimodal: Needs to understand both visual and textual information, integrating evidence from different modalities;
Multi-entity: Questions involve multiple entities, requiring identification and understanding of relationships between entities;
Multi-hop: Requires sequential or parallel multi-step reasoning.

Key Design Details

Traceable evidence: Annotations of evidence fragments, sources, and reasoning chain steps required for answers;
Multimodal knowledge base: Includes image background knowledge, cross-document associations, and entity semantic relationships;

Construction Process

Candidate question generation → 2. Multi-entity constraint check → 3. Multi-hop reasoning verification → 4. Evidence annotation → 5. Quality review

Diversity Coverage

Covers diversity in entity types, modality combinations, reasoning types, and domain distribution.

Section 04

Evaluation Framework and Key Findings: Complex Reasoning Performance of MLLMs

Three Evaluation Settings

Without external knowledge: Only relies on the model's internal knowledge to test basic reasoning capabilities;
Gold evidence: Provides manually annotated evidence to isolate the impact of retrieval;
Retrieval enhancement: The model autonomously retrieves information from the knowledge base to simulate real scenarios.

Key Findings

Weak performance without external knowledge: Challenges exist in fine-grained entity recognition, cross-modal alignment, and long-range reasoning;
Gold evidence significantly improves performance: The bottleneck lies in information retrieval rather than reasoning itself;
Reasoning-aware retrieval is better: Retrieval strategies that dynamically match reasoning needs are superior to heuristic methods.

Section 05

Contributions and Limitations: Value to the Research Community and Future Directions

Community Contributions

Sets stricter evaluation standards for MLLMs;
Promotes multimodal reasoning research and interpretability exploration;
Serves as a testbed for Retrieval-Augmented Generation (RAG) systems.

Current Limitations

Language limitation: Mainly in English;
Domain coverage: Insufficient in professional fields (e.g., medical imaging);
Dynamic reasoning: Lack of multi-turn interaction scenarios.

Future Extensions

Multilingual versions, video understanding, interactive evaluation, and open-ended generation tasks.

Section 06

Application Implications and Summary: Model Development and Deployment Strategies

Recommendations for Developers

Invest in retrieval modules: Prioritize improving information acquisition capabilities;
Strengthen multimodal pre-training: Increase the proportion of multi-entity data;
Explicitly model reasoning chains: Replace implicit end-to-end learning.

Recommendations for Deployers

Combine retrieval strategies: Integrate keyword matching and reasoning-aware retrieval;
Evidence verification mechanism: Ensure answers have reliable sources;
Human-machine collaboration: Complex questions are finally verified by humans.

Summary

M³-VQA represents a new height in VQA benchmarks, revealing the bottlenecks of current MLLMs in complex reasoning, and pointing out directions for future research—improving retrieval intelligence, reasoning mechanisms, and cross-modal integration capabilities.