Zing Forum

Reading

ChartJudge-2B: An Open-Source Small Vision-Language Model Judge for Chart Understanding Evaluation

An open-source project featured in two papers (ACL 2025 and EMNLP 2025), proposing the LVLM-as-a-Judge evaluation framework and releasing the 2B-parameter ChartJudge model, which delivers chart understanding evaluation capabilities comparable to GPT-4o despite its compact size.

视觉语言模型图表理解LVLM模型评估ACL 2025EMNLP 2025开源模型多模态AIChartJudge
Published 2026-04-20 05:41Recent activity 2026-04-20 05:50Estimated read 7 min
ChartJudge-2B: An Open-Source Small Vision-Language Model Judge for Chart Understanding Evaluation
1

Section 01

Introduction: ChartJudge-2B—A Breakthrough in Compact Open-Source Chart Evaluation Models

An open-source project featured in two papers (ACL 2025 and EMNLP 2025), proposing the LVLM-as-a-Judge evaluation framework and releasing the 2B-parameter ChartJudge-2B model. This model achieves chart understanding evaluation capabilities close to GPT-4o with a minimal size, balancing cost-effectiveness and evaluation quality.

2

Section 02

Research Background: Pain Points in Chart Understanding Evaluation

Chart understanding is a key challenge for Vision-Language Models (LVLMs), requiring accurate data extraction and trend comprehension. Existing evaluations rely on manual annotation or closed-source large models (e.g., GPT-4), which are costly. This project explores using open-source LVLMs as 'judges' for chart understanding tasks, constructing a framework and launching the ChartJudge-2B model.

3

Section 03

Core Methodology: Detailed Explanation of the LVLM-as-a-Judge Evaluation Framework

Multi-Dimensional Evaluation Modes

  • Pairwise Evaluation: Select the better answer from two candidates
  • Single-Point Scoring: 1-5 Likert scale scoring
  • With/Without Reference Evaluation: Optional provision of standard answers

Multi-Criteria Evaluation Dimensions

  • Factual Correctness: Data consistency with the chart
  • Information Richness: Sufficiency of the answer's information
  • Relevance: Alignment with the question
  • Multi-Dimensional Comprehensive Quality: Overall evaluation

Large-Scale Benchmark Testing

Over 100,000 judgment annotations were conducted on the OpenCQA and VisText datasets. Using GPT-4o and LLaVA-Critic-70B as references, 13 open-source LVLMs (2B-9B parameters) were evaluated.

4

Section 04

ChartJudge-2B: A Compact Judge Model with Strong Capabilities

Performance

Model OpenCQA (Pairwise ↑) VisText L1 (Pairwise ↑) VisText L2/L3 (Pairwise ↑)
Qwen2-VL-2B (Base Version) 54.0% 27.2% 3.0%
ChartJudge-2B 61.7% 64.6% 52.3%
LLaVA-Critic-7B 79.5% 79.1% 77.1%
ChartJudge-2B shows a significant improvement over the base model, and its VisText L1 performance exceeds that of the 7B model.

Robustness to Multi-Criteria Prompts

Under multi-criteria prompts, the accuracy of 7B models (e.g., LLaVA-Critic) plummets to nearly 0%, while ChartJudge-2B maintains an accuracy of 46.86%.

Deployment Advantages

  • Speed: 2x faster than 7B judge models
  • Cost: 2x lower operational cost
  • Hardware: Can run on GPUs with 8GB VRAM (e.g., T4)
5

Section 05

Key Findings: Evaluation Potential and Limitations of Open-Source Models

  • Potential of open-source models: Some 7B open-source LVLMs have chart evaluation capabilities close to GPT-4o (about 80% consistency), making them suitable for privacy-sensitive scenarios.
  • Limitations of specialized models: Chart-specific models like ChartGemma and PaliGemma have 0% accuracy when used as judges, indicating that specialized understanding ability ≠ general evaluation ability.
  • Double-edged sword of multi-criteria prompts: They provide rich dimensions but expose model vulnerabilities—7B models almost fail.
  • Cross-model generalization: ChartJudge-2B was trained using Gemini-1.5-Pro as a reference, but remains stable when evaluated with GPT-4o/LLaVA-Critic-70B.
  • Correlation with human judgment: LLaVA-Critic-70B has a higher correlation with human judgment (mean error distance of 0.81) than GPT-4o (0.93).
  • Prevalent biases: All judge models exhibit position bias and length bias.
  • Power of fine-tuning: After fine-tuning, PaliGemma-3B's VisText pairwise accuracy increased from 0% to 55.9%.
6

Section 06

Application Value: Reducing Costs and Promoting Evaluation Standardization

  • Cost reduction: Replaces GPT-4o, providing an economical solution for large-scale evaluations.
  • Privacy scenarios: Local deployment of open-source models is suitable for enterprises that cannot use external APIs.
  • Evaluation standardization: Proposes pairwise/single-point scoring, multi-dimensional evaluation paradigms, and metrics, providing references for domain standardization.
  • Revealing capability boundaries: By comparing over 13 open-source LVLMs, it reveals vulnerabilities under multi-criteria prompts and points out improvement directions.
7

Section 07

Open-Source Resources: Full Access to Code, Models, and Data

The project's open-source content includes:

  • Complete implementation of the evaluation framework
  • ChartJudge-2B model weights
  • Training dataset (~9.7K single-criteria + ~2.8K multi-criteria)
  • Evaluation scripts and benchmark testing code
  • Experiment configurations and hyperparameters Chart image data can be downloaded via the project's Google Drive link.