Zing Forum

Reading

A Panoramic Survey of Evaluation Benchmarks for Multimodal Large Language Models: Systematic Review of Over 200 Benchmarks and Future Outlook

This paper systematically reviews over 200 evaluation benchmarks for Multimodal Large Language Models (MLLMs), covering five major dimensions: perceptual understanding, cognitive reasoning, domain-specific applications, key capabilities, and multimodal extensions. It provides a comprehensive research framework and directional guidance for the systematic evaluation of MLLMs.

多模态大语言模型MLLM评测基准benchmark视觉问答跨模态推理幻觉检测多模态评估AI评测大模型评测
Published 2026-05-26 20:13Recent activity 2026-05-26 20:23Estimated read 8 min
A Panoramic Survey of Evaluation Benchmarks for Multimodal Large Language Models: Systematic Review of Over 200 Benchmarks and Future Outlook
1

Section 01

【Introduction】A Panoramic Survey of Evaluation Benchmarks for Multimodal Large Language Models: Systematic Review of Over 200 Benchmarks and Future Outlook

Title: A Panoramic Survey of Evaluation Benchmarks for Multimodal Large Language Models: Systematic Review of Over 200 Benchmarks and Future Outlook Source: Tencent in collaboration with teams from Peking University, National University of Singapore, Southeast University, and Nanjing University (Original author/maintainer: swordlidev), published on GitHub (Link: https://github.com/swordlidev/Evaluation-Multimodal-LLMs-Survey), release date: 2026-05-26. Core Viewpoint: This paper systematically reviews over 200 evaluation benchmarks for Multimodal Large Language Models (MLLMs), covering five major dimensions: perceptual understanding, cognitive reasoning, domain-specific applications, key capabilities, and multimodal extensions. It provides a comprehensive research framework and directional guidance for the systematic evaluation of MLLMs.

2

Section 02

Research Background and Motivation

Multimodal Large Language Models (MLLMs) are currently a hot topic in academia and industry. They can process multimodal data such as text and images, and perform well in tasks like visual question answering. However, existing evaluations are scattered and lack systematic integration, making it difficult for researchers to quickly understand available benchmarks and their differences. Based on this pain point, Tencent collaborated with multiple universities to launch this survey.

3

Section 03

Five-Dimensional Classification System of Evaluation Benchmarks

The survey constructs a five-dimensional classification framework:

  1. Perception and Understanding: Comprehensive evaluation (e.g., ChEF, UniBench), fine-grained perception (e.g., CODE), image understanding (e.g., Memenos), image quality and aesthetics (e.g., AesBench);
  2. Cognition and Reasoning: General reasoning (e.g., MMRel), chain-of-thought reasoning (e.g., Visual CoT), knowledge reasoning (e.g., KB-VQA), intelligent question answering (e.g., RAVEN), multi-disciplinary question answering (e.g., CMMMU);
  3. Domain-Specific Applications: Text-rich visual question answering (e.g., TextVQA), document question answering (e.g., SPDocVQA), chart reasoning (e.g., ChartQA), web page understanding (e.g., Web2Code), decision-making agents (e.g., VisualAgentBench), mobile agents (e.g., Mobile-Eval);
  4. Key Capabilities: Dialogue ability (e.g., Mile-Bench), hallucination issues (e.g., POPE), credibility (e.g., MAD-Bench);
  5. Other Modal Extensions: Video understanding (e.g., MVBench), audio understanding (e.g., Dynamic-SUPERB), 3D point clouds (e.g., ScanQA), full modalities (e.g., MCUB).
4

Section 04

Development Trends and Insights of Evaluation Benchmarks

Development Trends:

  • From single capability to comprehensive capability: Early benchmarks focused on single tasks; in recent years, comprehensive benchmarks (e.g., MME) provide holistic evaluation;
  • From static to dynamic: Traditional benchmarks are based on static images; the growth of video understanding benchmarks reflects the demand for temporal reasoning;
  • From general to vertical: Specialized benchmarks for specific domains (e.g., healthcare, autonomous driving) are emerging;
  • From performance to credibility: Hallucination detection, robustness, etc., have become hot topics.
5

Section 05

Current Limitations and Future Directions

Current Limitations:

  • Data leakage: Some benchmark data are used for pre-training, leading to overestimated performance;
  • Incomplete evaluation dimensions: Lack of benchmarks for causal reasoning, common sense reasoning, etc.;
  • Subjectivity challenges: Open-ended generation tasks are difficult to evaluate objectively and automatically;
  • Cross-modal alignment: Need more refined frameworks to evaluate modal fusion capabilities.

Future Directions: Build dynamically updated benchmarks, develop reliable automatic metrics, strengthen cross-modal system evaluation, and establish a joint evaluation framework for capability and safety.

6

Section 06

Practical Value and Community Contributions

This survey is a collection of practical resources. The GitHub repository is continuously maintained, integrating the paper, code, and dataset links for all benchmarks. It provides one-stop navigation for researchers, significantly reducing research costs and helping them quickly locate suitable evaluation tools.

7

Section 07

Conclusion

The development of multimodal large language models relies on a scientific and comprehensive evaluation system, and this survey provides a systematic perspective. As model capabilities improve, evaluation benchmarks need to evolve continuously. We call on the community to pay attention to evaluation innovation, drive the healthy development of models through evaluation, and promote more reliable and practical multimodal AI.