Reading

A Panoramic Survey of Evaluation Benchmarks for Multimodal Large Language Models: Systematic Review of Over 200 Benchmarks and Future Outlook

This paper systematically reviews over 200 evaluation benchmarks for Multimodal Large Language Models (MLLMs), covering five major dimensions: perceptual understanding, cognitive reasoning, domain-specific applications, key capabilities, and multimodal extensions. It provides a comprehensive research framework and directional guidance for the systematic evaluation of MLLMs.

多模态大语言模型MLLM评测基准benchmark视觉问答跨模态推理幻觉检测多模态评估AI评测大模型评测

Published 2026-05-26 20:13Recent activity 2026-05-26 20:23Estimated read 8 min

A Panoramic Survey of Evaluation Benchmarks for Multimodal Large Language Models: Systematic Review of Over 200 Benchmarks and Future Outlook

Section 01

【Introduction】A Panoramic Survey of Evaluation Benchmarks for Multimodal Large Language Models: Systematic Review of Over 200 Benchmarks and Future Outlook

Title: A Panoramic Survey of Evaluation Benchmarks for Multimodal Large Language Models: Systematic Review of Over 200 Benchmarks and Future Outlook Source: Tencent in collaboration with teams from Peking University, National University of Singapore, Southeast University, and Nanjing University (Original author/maintainer: swordlidev), published on GitHub (Link: https://github.com/swordlidev/Evaluation-Multimodal-LLMs-Survey), release date: 2026-05-26. Core Viewpoint: This paper systematically reviews over 200 evaluation benchmarks for Multimodal Large Language Models (MLLMs), covering five major dimensions: perceptual understanding, cognitive reasoning, domain-specific applications, key capabilities, and multimodal extensions. It provides a comprehensive research framework and directional guidance for the systematic evaluation of MLLMs.

Section 02

Research Background and Motivation

Multimodal Large Language Models (MLLMs) are currently a hot topic in academia and industry. They can process multimodal data such as text and images, and perform well in tasks like visual question answering. However, existing evaluations are scattered and lack systematic integration, making it difficult for researchers to quickly understand available benchmarks and their differences. Based on this pain point, Tencent collaborated with multiple universities to launch this survey.

Section 03

Five-Dimensional Classification System of Evaluation Benchmarks

The survey constructs a five-dimensional classification framework:

Perception and Understanding: Comprehensive evaluation (e.g., ChEF, UniBench), fine-grained perception (e.g., CODE), image understanding (e.g., Memenos), image quality and aesthetics (e.g., AesBench);
Cognition and Reasoning: General reasoning (e.g., MMRel), chain-of-thought reasoning (e.g., Visual CoT), knowledge reasoning (e.g., KB-VQA), intelligent question answering (e.g., RAVEN), multi-disciplinary question answering (e.g., CMMMU);
Domain-Specific Applications: Text-rich visual question answering (e.g., TextVQA), document question answering (e.g., SPDocVQA), chart reasoning (e.g., ChartQA), web page understanding (e.g., Web2Code), decision-making agents (e.g., VisualAgentBench), mobile agents (e.g., Mobile-Eval);
Key Capabilities: Dialogue ability (e.g., Mile-Bench), hallucination issues (e.g., POPE), credibility (e.g., MAD-Bench);
Other Modal Extensions: Video understanding (e.g., MVBench), audio understanding (e.g., Dynamic-SUPERB), 3D point clouds (e.g., ScanQA), full modalities (e.g., MCUB).

Section 04

Development Trends and Insights of Evaluation Benchmarks

Development Trends:

From single capability to comprehensive capability: Early benchmarks focused on single tasks; in recent years, comprehensive benchmarks (e.g., MME) provide holistic evaluation;
From static to dynamic: Traditional benchmarks are based on static images; the growth of video understanding benchmarks reflects the demand for temporal reasoning;
From general to vertical: Specialized benchmarks for specific domains (e.g., healthcare, autonomous driving) are emerging;
From performance to credibility: Hallucination detection, robustness, etc., have become hot topics.

Section 05

Current Limitations and Future Directions

Current Limitations:

Data leakage: Some benchmark data are used for pre-training, leading to overestimated performance;
Incomplete evaluation dimensions: Lack of benchmarks for causal reasoning, common sense reasoning, etc.;
Subjectivity challenges: Open-ended generation tasks are difficult to evaluate objectively and automatically;
Cross-modal alignment: Need more refined frameworks to evaluate modal fusion capabilities.

Future Directions: Build dynamically updated benchmarks, develop reliable automatic metrics, strengthen cross-modal system evaluation, and establish a joint evaluation framework for capability and safety.

Section 06

Practical Value and Community Contributions

This survey is a collection of practical resources. The GitHub repository is continuously maintained, integrating the paper, code, and dataset links for all benchmarks. It provides one-stop navigation for researchers, significantly reducing research costs and helping them quickly locate suitable evaluation tools.

Section 07

Conclusion

The development of multimodal large language models relies on a scientific and comprehensive evaluation system, and this survey provides a systematic perspective. As model capabilities improve, evaluation benchmarks need to evolve continuously. We call on the community to pay attention to evaluation innovation, drive the healthy development of models through evaluation, and promote more reliable and practical multimodal AI.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15