Zing Forum

Reading

OCRBench: Unveiling the Hidden Mysteries of OCR Capabilities in Large Language Models

This article introduces the OCRBench series of benchmarks, including OCRBench, OCRBench v2, and MDPBench, which are used to comprehensively evaluate the capabilities of large multimodal models (LMMs) in text recognition, scene text understanding, document parsing, and other areas.

OCRmultimodal modelsbenchmarktext recognitiondocument parsingmultilingual
Published 2026-04-02 22:14Recent activity 2026-04-02 22:24Estimated read 6 min
OCRBench: Unveiling the Hidden Mysteries of OCR Capabilities in Large Language Models
1

Section 01

[Introduction] OCRBench Series Benchmarks: Key Tools for Comprehensive Evaluation of OCR Capabilities in Large Language Models

Optical Character Recognition (OCR) technology has undergone transformation with the rise of Large Multimodal Models (LMMs). However, traditional evaluations only focus on character/word accuracy and fail to cover the comprehensive capabilities of LMMs such as semantic understanding and information extraction. The OCRBench series of benchmarks (including the original OCRBench, v2, and MDPBench) emerged to fill this gap in comprehensive evaluation, providing a systematic assessment tool for the research community and driving progress in the OCR field.

2

Section 02

1. Background of OCRBench: Limitations and Needs of Traditional OCR Evaluation

Traditional OCR evaluations focus on character/word-level recognition accuracy, while LMMs possess comprehensive capabilities such as scene text understanding, structured document information extraction, handwritten mathematical formula recognition, and multilingual processing. Existing benchmarks only cover single tasks and lack comprehensive evaluation. OCRBench aims to fill this gap by providing a comprehensive benchmark covering multiple OCR tasks.

3

Section 03

2. Design and Features of Core Versions in the OCRBench Series

  1. Original OCRBench: Includes five components (text recognition, scene text VQA, document-oriented VQA, key information extraction, handwritten mathematical expression recognition), 1000 manually verified question-answer pairs, and has been accepted by Science China Information Sciences.
  2. OCRBench v2: Has four times the number of tasks as the original, covers 31 scenarios, contains 10000 manually verified QA pairs (with a high proportion of hard samples), uses more detailed evaluation metrics, and has been accepted by the NeurIPS 2025 Dataset and Benchmark Track.
  3. MDPBench: The first multilingual document parsing benchmark, includes 3400 document images (covering 17 languages, diverse writing systems, and different shooting conditions), and ensures quality through strict annotation processes.
4

Section 04

3. Evaluation Evidence from OCRBench and Related Dataset Resources

  • MDPBench evaluation findings: Closed-source models (e.g., Gemini 3-Pro) are relatively robust; open-source models show an average drop of 14.0% on non-Latin scripts and 17.8% on captured documents, revealing performance imbalance across languages and conditions.
  • Related datasets: EST-VQA (Chinese-English bilingual scene text VQA, CVPR2020), Swahili dataset (ICDAR2024), Urdu dataset (ICDAR2024), MTVQA (9 languages), Oracle bone script datasets (EVOBC, HUST-OBC), etc., supporting multilingual and low-resource language research.
5

Section 05

4. Technical Significance and Community Impact of OCRBench

  1. Promote model improvement: Clarify optimization goals and address model weaknesses (e.g., multilingual issues of open-source models revealed by MDPBench).
  2. Facilitate fair comparison: Standardized benchmarks enable fairer comparisons between different models.
  3. Support industrial applications: Help enterprises evaluate model applicability (e.g., multilingual invoice processing).
  4. Reveal research gaps: Identify issues like performance gaps of open-source models on non-Latin scripts.
  5. Community integration: Already integrated into mainstream evaluation frameworks like lmms-eval and VLMEvalKit.
6

Section 06

5. Future Outlook of OCRBench: Directions for Continuous Evolution

  1. Expand task types: Add video text recognition, 3D scene text understanding, etc.
  2. Increase language support: Cover more low-resource languages and endangered writing systems.
  3. Fine-grained evaluation: Develop metrics that distinguish between character/word recognition, semantic understanding, and other levels.
  4. Real-time performance evaluation: Focus on inference speed and resource consumption to support practical deployment.