# ChartJudge-2B: An Open-Source Small Vision-Language Model Judge for Chart Understanding Evaluation

> An open-source project featured in two papers (ACL 2025 and EMNLP 2025), proposing the LVLM-as-a-Judge evaluation framework and releasing the 2B-parameter ChartJudge model, which delivers chart understanding evaluation capabilities comparable to GPT-4o despite its compact size.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-19T21:41:57.000Z
- 最近活动: 2026-04-19T21:50:19.213Z
- 热度: 152.9
- 关键词: 视觉语言模型, 图表理解, LVLM, 模型评估, ACL 2025, EMNLP 2025, 开源模型, 多模态AI, ChartJudge
- 页面链接: https://www.zingnex.cn/en/forum/thread/chartjudge-2b
- Canonical: https://www.zingnex.cn/forum/thread/chartjudge-2b
- Markdown 来源: floors_fallback

---

## Introduction: ChartJudge-2B—A Breakthrough in Compact Open-Source Chart Evaluation Models

An open-source project featured in two papers (ACL 2025 and EMNLP 2025), proposing the LVLM-as-a-Judge evaluation framework and releasing the 2B-parameter ChartJudge-2B model. This model achieves chart understanding evaluation capabilities close to GPT-4o with a minimal size, balancing cost-effectiveness and evaluation quality.

## Research Background: Pain Points in Chart Understanding Evaluation

Chart understanding is a key challenge for Vision-Language Models (LVLMs), requiring accurate data extraction and trend comprehension. Existing evaluations rely on manual annotation or closed-source large models (e.g., GPT-4), which are costly. This project explores using open-source LVLMs as 'judges' for chart understanding tasks, constructing a framework and launching the ChartJudge-2B model.

## Core Methodology: Detailed Explanation of the LVLM-as-a-Judge Evaluation Framework

### Multi-Dimensional Evaluation Modes
- Pairwise Evaluation: Select the better answer from two candidates
- Single-Point Scoring: 1-5 Likert scale scoring
- With/Without Reference Evaluation: Optional provision of standard answers

### Multi-Criteria Evaluation Dimensions
- Factual Correctness: Data consistency with the chart
- Information Richness: Sufficiency of the answer's information
- Relevance: Alignment with the question
- Multi-Dimensional Comprehensive Quality: Overall evaluation

### Large-Scale Benchmark Testing
Over 100,000 judgment annotations were conducted on the OpenCQA and VisText datasets. Using GPT-4o and LLaVA-Critic-70B as references, 13 open-source LVLMs (2B-9B parameters) were evaluated.

## ChartJudge-2B: A Compact Judge Model with Strong Capabilities

#### Performance
| Model | OpenCQA (Pairwise ↑) | VisText L1 (Pairwise ↑) | VisText L2/L3 (Pairwise ↑) |
|------|------------------|---------------------|------------------------|
| Qwen2-VL-2B (Base Version) | 54.0% | 27.2% | 3.0% |
| **ChartJudge-2B** | **61.7%** | **64.6%** | **52.3%** |
| LLaVA-Critic-7B | 79.5% | 79.1% | 77.1% |
ChartJudge-2B shows a significant improvement over the base model, and its VisText L1 performance exceeds that of the 7B model.

#### Robustness to Multi-Criteria Prompts
Under multi-criteria prompts, the accuracy of 7B models (e.g., LLaVA-Critic) plummets to nearly 0%, while ChartJudge-2B maintains an accuracy of 46.86%.

#### Deployment Advantages
- Speed: 2x faster than 7B judge models
- Cost: 2x lower operational cost
- Hardware: Can run on GPUs with 8GB VRAM (e.g., T4)

## Key Findings: Evaluation Potential and Limitations of Open-Source Models

- Potential of open-source models: Some 7B open-source LVLMs have chart evaluation capabilities close to GPT-4o (about 80% consistency), making them suitable for privacy-sensitive scenarios.
- Limitations of specialized models: Chart-specific models like ChartGemma and PaliGemma have 0% accuracy when used as judges, indicating that specialized understanding ability ≠ general evaluation ability.
- Double-edged sword of multi-criteria prompts: They provide rich dimensions but expose model vulnerabilities—7B models almost fail.
- Cross-model generalization: ChartJudge-2B was trained using Gemini-1.5-Pro as a reference, but remains stable when evaluated with GPT-4o/LLaVA-Critic-70B.
- Correlation with human judgment: LLaVA-Critic-70B has a higher correlation with human judgment (mean error distance of 0.81) than GPT-4o (0.93).
- Prevalent biases: All judge models exhibit position bias and length bias.
- Power of fine-tuning: After fine-tuning, PaliGemma-3B's VisText pairwise accuracy increased from 0% to 55.9%.

## Application Value: Reducing Costs and Promoting Evaluation Standardization

- Cost reduction: Replaces GPT-4o, providing an economical solution for large-scale evaluations.
- Privacy scenarios: Local deployment of open-source models is suitable for enterprises that cannot use external APIs.
- Evaluation standardization: Proposes pairwise/single-point scoring, multi-dimensional evaluation paradigms, and metrics, providing references for domain standardization.
- Revealing capability boundaries: By comparing over 13 open-source LVLMs, it reveals vulnerabilities under multi-criteria prompts and points out improvement directions.

## Open-Source Resources: Full Access to Code, Models, and Data

The project's open-source content includes:
- Complete implementation of the evaluation framework
- ChartJudge-2B model weights
- Training dataset (~9.7K single-criteria + ~2.8K multi-criteria)
- Evaluation scripts and benchmark testing code
- Experiment configurations and hyperparameters
Chart image data can be downloaded via the project's Google Drive link.
