Zing Forum

Reading

MuDABench: A New Benchmark for Large-Scale Document Analysis QA, Revealing Bottlenecks of RAG Systems

The new benchmark MuDABench contains 80,000 pages of documents and 332 analytical QA pairs, challenging the limitations of existing RAG systems in large-scale cross-document reasoning.

RAG多文档问答基准测试信息提取智能体工作流文档智能金融AI
Published 2026-04-24 13:28Recent activity 2026-04-27 10:59Estimated read 10 min
MuDABench: A New Benchmark for Large-Scale Document Analysis QA, Revealing Bottlenecks of RAG Systems
1

Section 01

MuDABench: A New Large-Scale Document Analysis QA Benchmark Revealing RAG System Bottlenecks

MuDABench is a new analytical QA benchmark for large-scale semi-structured document collections, including 80,000 pages of documents and 332 analytical QA instances. It aims to fill the gap in existing multi-document QA benchmarks regarding cross-document reasoning requirements. Through this benchmark, the study reveals the bottlenecks of standard RAG systems and proposes optimization directions such as multi-agent workflows, providing guidance for the design of next-generation RAG systems.

2

Section 02

Background: Limitations of Existing Multi-Document QA Benchmarks and Real-Scenario Needs

New Challenges in Multi-Document QA

Retrieval-Augmented Generation (RAG) technology has enabled large language models to answer questions based on external documents, but existing multi-document QA benchmarks usually only require extracting information from a few documents, with limited cross-document reasoning needs. This contrasts with real-world application scenarios (such as financial analysis, legal research, etc.)—analysts need to process thousands of pages of documents, perform complex cross-document information integration and quantitative analysis. To fill this gap, the research team launched MuDABench.

3

Section 03

Unique Design of MuDABench

What Makes MuDABench Unique

Scale and Complexity

MuDABench embodies the "real scenario" design philosophy:

  • 80,000+ pages of documents: Far exceeding the scale of existing benchmarks
  • 332 analytical QA instances: Each question requires complex cross-document reasoning
  • Real financial domain data: Built based on document-level metadata and annotated financial databases

Nature of Analytical QA

Unlike traditional QA, MuDABench's questions require:

  1. Information extraction: Locate relevant information from multiple documents
  2. Information synthesis: Integrate scattered information into a coherent understanding
  3. Quantitative analysis: Perform computational reasoning based on extracted data
  4. Conclusion generation: Form structured analytical answers

This design is closer to real scenarios such as business analysis and investment research.

4

Section 04

Innovative Evaluation Protocol

Innovation in Evaluation Protocol

The research team proposed dual evaluation metrics:

Final Answer Accuracy

Measures the matching degree between the model-generated answer and the reference answer; it is a traditional end-to-end evaluation.

Intermediate Fact Coverage

As an auxiliary diagnostic signal, it evaluates whether the model correctly identifies and uses key intermediate facts during reasoning. It helps distinguish whether answers are based on correct reasoning, identify model error links, and check the integrity of the reasoning chain, providing directions for system optimization.

5

Section 05

Experimental Findings: Limitations of Standard RAG and Breakthroughs with Multi-Agent

Experimental Findings: Limitations of Standard RAG

Problems with Flat Retrieval Pool

Experiments reveal that standard RAG systems treating large-scale documents as a "flat retrieval pool" perform poorly, facing challenges such as retrieval noise, context fragmentation, and missing relationships.

Breakthrough of Multi-Agent Workflow

To overcome the limitations, the study proposes a multi-agent workflow that coordinates three modules:

  1. Planning module: Analyze the problem and formulate information collection strategies
  2. Extraction module: Precisely extract structured information from target documents
  3. Code generation module: Convert data into executable code for quantitative analysis

This architecture significantly improves metrics, but there is still a gap compared to human experts.

6

Section 06

Two Major Bottlenecks of Current Systems

Identification of Two Major Bottlenecks

After in-depth analysis of failure cases, the study identifies two major bottlenecks:

Bottleneck 1: Insufficient Accuracy in Single-Document Information Extraction

Even if the correct document is located, the model often makes mistakes: numerical extraction errors (e.g., misreading "150 million" as "1.5 billion"), entity relationship confusion, table data misalignment, etc.

Bottleneck 2: Lack of Domain Knowledge

Financial analysis requires deep domain knowledge: understanding accounting terms, industry-specific rules, grasping business logic, etc. General-purpose LLMs are clearly insufficient and need specialized domain adaptation.

7

Section 07

Insights for RAG System Design

Insights for RAG System Design

The MuDABench research results provide important guidance:

1. Hierarchical Retrieval Architecture

Abandon the "flat retrieval pool" mindset and build a hierarchical system: top-level document filtering, middle-level chapter positioning, bottom-level precise extraction.

2. Structured Information Extraction

Develop specialized extraction modules: parse complex tables and charts, understand document hierarchical structures, maintain entity relationship graphs.

3. Domain Adaptation

Build for specific domains: domain term dictionaries, reasoning rules, fine-tuning datasets.

4. Human-Machine Collaboration Workflow

Design human-machine collaboration processes: AI initial screening and positioning, human verification of key results, AI-assisted calculation and reporting, human final decision-making.

8

Section 08

Open Source and Conclusion

Open Source and Community Contribution

MuDABench has been open-sourced on GitHub (https://github.com/Zhanli-Li/MuDABench), providing large-scale real document collections, high-quality QA annotations, baseline system implementations, and evaluation tool scripts, serving as an experimental platform for research in RAG systems, document intelligence, and other fields.

Conclusion

MuDABench is not only a new benchmark but also a reminder of the development direction of RAG technology: when AI moves from demonstration to production environments, scale, complexity, and domain expertise are the real tests. Understanding the bottlenecks is the first step to solving them.