Zing Forum

Reading

SymbolBench: A Comprehensive Evaluation Benchmark for Visual Symbol Understanding Capabilities of Multimodal Large Language Models

SymbolBench, developed by the Knowledge Engineering Laboratory of Tsinghua University, is a comprehensive benchmark specifically designed to evaluate the discrete visual symbol recognition, parsing, and reasoning capabilities of multimodal large language models (MLLMs), filling the gap in the current evaluation system for structured visual understanding.

多模态大语言模型视觉符号理解基准测试符号推理MLLM评测清华大学
Published 2026-04-08 15:43Recent activity 2026-04-08 15:49Estimated read 6 min
SymbolBench: A Comprehensive Evaluation Benchmark for Visual Symbol Understanding Capabilities of Multimodal Large Language Models
1

Section 01

[Introduction] SymbolBench: A Professional Evaluation Benchmark for Visual Symbol Understanding of Multimodal Large Language Models

The Knowledge Engineering Laboratory of Tsinghua University has launched SymbolBench, a comprehensive benchmark specifically designed to evaluate the discrete visual symbol recognition, parsing, and reasoning capabilities of multimodal large language models (MLLMs). It fills the gap in the current evaluation system for structured visual understanding. This benchmark follows the design principles of comprehensiveness, hierarchy, and practicality, covering multiple symbol types and multi-dimensional tasks. It reveals the capability stratification phenomenon of current mainstream models in symbol understanding and provides improvement directions for the research community.

2

Section 02

Background and Motivation: The Lack of Evaluation for Discrete Visual Symbols

With the rapid development of MLLMs such as GPT-4V and Gemini, existing evaluations mostly focus on natural image understanding (e.g., object recognition, scene description), but the evaluation of discrete visual symbols (mathematical formulas, flowcharts, circuit diagrams, etc.) is weak. These symbols are highly structured and abstract, requiring models to understand spatial relationships, logical hierarchies, and semantic associations between elements. SymbolBench was created precisely to fill this gap.

3

Section 03

Core Design Philosophy and Evaluation Task Dimensions

SymbolBench is designed following three core principles:

  1. Comprehensiveness: Covers multiple symbol types such as mathematical expressions, logical diagrams, and engineering drawings;
  2. Hierarchy: From basic symbol recognition to parsing structured representations, then to symbol reasoning and computation;
  3. Practicality: Tasks are close to real-world scenarios (e.g., formula calculation, flowchart logic understanding). The evaluation tasks include four dimensions: symbol recognition and localization, parsing and structuring, reasoning and computation, and cross-symbol type transfer.
4

Section 04

Technical Implementation and Dataset Construction

The dataset construction combines real data (academic papers, textbooks) and synthetic data, covering multiple visual styles (hand-drawn, software-generated, scanned). Annotations include symbol bounding boxes, structured results (LaTeX, JSON), and task answers. Evaluation metrics are differentiated: precision/recall/F1 for recognition, tree edit distance for parsing, and accuracy for reasoning.

5

Section 05

Current Model Performance: Capability Stratification and Shortcomings

Preliminary evaluations show that mainstream models have obvious capability stratification:

  • High accuracy in basic recognition tasks;
  • Prone to structural errors in parsing tasks (e.g., confusion of nested hierarchies);
  • Hallucinations exist in reasoning tasks (generating conclusions inconsistent with symbol meanings);
  • Significant performance differences among models on different symbol types, reflecting insufficient proportion of symbols in training data.
6

Section 06

Implications and Recommendations for the Research Community

  1. Symbol understanding requires dedicated modules: It should not be regarded as a subset of general vision; enhanced attention mechanisms or symbol-aware pre-training can be introduced;
  2. Increase high-quality symbol data: Improve the proportion of symbols in training data, especially data with parsing annotations and reasoning chains;
  3. Emphasize domain-specific benchmarks: SymbolBench provides a clear evaluation framework for research and guides future directions.
7

Section 07

Conclusion: The Significance and Future of SymbolBench

As the first discrete visual symbol evaluation benchmark, SymbolBench reveals the capability boundaries of current MLLMs and points out directions for model improvement. With the deepening penetration of multimodal AI, structured visual information understanding will become an important criterion for practical value, and SymbolBench is a key step in this direction.