# FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation

> This article introduces the FinDocMRE benchmark, which contains 12,207 samples from 2,878 financial reports. It evaluates the document-level reasoning capabilities of large multimodal models (LMMs) across five task types, and the results show that no model's overall score exceeds 65 points.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-18T07:18:01.000Z
- 最近活动: 2026-05-19T03:30:12.618Z
- 热度: 126.8
- 关键词: 金融AI, 多模态推理, 基准测试, 文档理解, 数值推理, 财务报告
- 页面链接: https://www.zingnex.cn/en/forum/thread/findocmre
- Canonical: https://www.zingnex.cn/forum/thread/findocmre
- Markdown 来源: floors_fallback

---

## [Main Floor] FinDocMRE Benchmark: A New Evaluation Standard for Document-Level Financial Multimodal Reasoning

FinDocMRE is a benchmark for document-level financial multimodal reasoning evaluation. Its data comes from 2,878 financial reports (covering 12 financial domains) and includes 12,207 samples. This benchmark designs five task types to evaluate the document-level reasoning capabilities of large multimodal models (LMMs). Experimental results show that currently no model's overall score exceeds 65 points (out of 100), revealing significant challenges for LMMs in financial scenarios.

## [Background] Limitations of Existing Financial Benchmarks and the Necessity of FinDocMRE

Financial analysis requires processing complex document-level information, including multi-source information integration (text, tables, images, etc.), cross-page association (e.g., cross-validation between income statements and notes), and domain expertise understanding. However, existing financial benchmarks mostly focus on isolated charts and cannot reflect the complexity of document-level reasoning. FinDocMRE aims to fill this gap and promote the development of multimodal reasoning capabilities in the financial field.

## [Methodology] Construction Process and Data Sources of FinDocMRE

The data sources are 2,878 real financial reports (annual/quarterly reports, prospectuses, etc.). The construction uses a semi-automated process: 1. Visual-centric generation: Automatically generate reasoning questions and answers centered on visual elements (charts, tables); 2. Expert validation: All samples are reviewed by financial experts to ensure accuracy and rationality. This process balances data scale (12,207 samples) and quality.

## [Tasks] Analysis of the Five Evaluation Task Types in FinDocMRE

Five tasks are designed to comprehensively evaluate multimodal reasoning capabilities: 1. Semantic narrative construction: Generate coherent textual descriptions based on visual information; 2. Numerical estimation: Extract estimated values from charts/tables; 3. Cross-page visual positioning: Associate visual information across pages; 4. Multi-image reasoning: Process multiple image information simultaneously; 5. Document-level understanding: Fully understand the structure and content of the document.

## [Experimental Results] Performance Analysis of Current LMMs in Financial Multimodal Reasoning

Evaluation of 11 representative LMMs found: 1. Overall performance: No model scored over 65 points; 2. Task differentiation: Semantic narrative construction performed well, while numerical estimation and cross-page visual positioning performed poorly. This indicates that models are good at "telling stories" but lack precision in numerical reasoning and cross-document association.

## [Implications] Key Guidance from FinDocMRE for the Development of Financial AI

1. Financial AI is still in its early stages, and direct application to document analysis is premature; 2. Numerical reasoning is a core bottleneck, requiring more numerical samples or specialized module design; 3. Cross-document understanding needs architectural innovation, as existing architectures struggle to support large-scale information integration.

## [Significance] Industry Value of the FinDocMRE Benchmark

As the first document-level financial multimodal reasoning benchmark, its significance includes: 1. Standardized evaluation: Provides a unified platform to support comparisons between models; 2. Precise diagnosis: Identifies model strengths and weaknesses through sub-tasks; 3. Direction guidance: Clarifies future research goals and directions.
