Zing Forum

Reading

New Benchmark for Cross-Document Multi-Entity QA: In-Depth Analysis of the MEBench Evaluation Framework

This article introduces the MEBench project accepted by the EMNLP 2025 main conference, a benchmark framework specifically designed to evaluate large language models' cross-document multi-entity question answering capabilities.

大语言模型跨文档问答多实体推理基准测试EMNLP信息检索RAG
Published 2026-05-20 17:04Recent activity 2026-05-20 17:20Estimated read 6 min
New Benchmark for Cross-Document Multi-Entity QA: In-Depth Analysis of the MEBench Evaluation Framework
1

Section 01

Introduction: In-Depth Analysis of MEBench, a New Benchmark for Cross-Document Multi-Entity QA

MEBench is a cross-document multi-entity QA benchmark framework accepted by the EMNLP 2025 main conference, specifically designed to evaluate large language models' cross-document multi-entity QA capabilities. It addresses the reasoning challenges posed by scattered information in real-world scenarios, covering core content such as dataset construction, evaluation metrics, and experimental results, helping to understand the capabilities and limitations of large models in complex information integration tasks.

2

Section 02

Background: Reasoning Challenges in Cross-Document Multi-Entity QA

Large language models have reached near-human levels in single-document reading comprehension, but real-world QA often requires cross-document reasoning. For example, a question comparing Tesla and BYD's 2024 R&D investment and market share requires extracting information from multiple documents and conducting comprehensive analysis. Such cross-document multi-entity QA tasks place higher demands on models' information integration and reasoning capabilities.

3

Section 03

MEBench Design and Dataset Construction Methods

Core objectives of MEBench: Authenticity (using real documents), Complexity (cross-document reasoning and multi-entity comparison), Scalability (supporting different domains and difficulty levels), Interpretability (detailed evaluation metrics and error analysis). Dataset construction process: Document collection (Wikipedia, news, academic literature) → Entity recognition → Relation extraction → Question generation. Question types include factual, comparative, causal, and inferential; difficulty levels are divided into 4 grades (Level1 to Level4, ranging from single-document extraction to complex comprehensive analysis).

4

Section 04

MEBench Evaluation Metric System

MEBench's multi-dimensional evaluation:

  1. Answer accuracy: Exact match, F1 score, semantic similarity
  2. Evidence recall: Document recall rate, evidence completeness, noise filtering
  3. Reasoning quality: Reasoning chain completeness, logical consistency, hallucination detection
5

Section 05

Experimental Results and Key Findings

Evaluation results of mainstream models (GPT-4, Claude, Llama, Qwen, etc.):

  1. Cross-document reasoning remains a challenge; accuracy drops by 15-25% compared to single-document tasks
  2. Long-context capability is a double-edged sword (better performance but prone to information overload)
  3. RAG methods show obvious advantages, but retrieval quality determines performance
  4. Instruction tuning improves format compliance, but core reasoning improvement is limited
6

Section 06

Application Value and Impact of MEBench

Academic value: Provides a standardized evaluation platform for fair model comparison, tracking domain progress, and identifying research directions. Industrial applications: Scenarios such as enterprise knowledge management, financial analysis, legal research, and medical diagnosis. Model development: Provides performance benchmarks, error analysis to guide improvements, and progressive training objectives.

7

Section 07

Limitations and Future Work Directions

Current limitations: Domain coverage (mainly general, few professional domains), language restrictions (English-dominated), difficulty in dynamic updates. Future directions: Expand professional domains, add multilingual support, develop dynamic update mechanisms, explore multimodal cross-document QA, and establish human-machine collaborative evaluation models.