Reading

ChartJudge-2B: An Open-Source Small Vision-Language Model Judge for Chart Understanding Evaluation

An open-source project featured in two papers (ACL 2025 and EMNLP 2025), proposing the LVLM-as-a-Judge evaluation framework and releasing the 2B-parameter ChartJudge model, which delivers chart understanding evaluation capabilities comparable to GPT-4o despite its compact size.

视觉语言模型图表理解LVLM模型评估ACL 2025EMNLP 2025开源模型多模态AIChartJudge

Published 2026-04-20 05:41Recent activity 2026-04-20 05:50Estimated read 7 min

ChartJudge-2B: An Open-Source Small Vision-Language Model Judge for Chart Understanding Evaluation

Section 01

Introduction: ChartJudge-2B—A Breakthrough in Compact Open-Source Chart Evaluation Models

An open-source project featured in two papers (ACL 2025 and EMNLP 2025), proposing the LVLM-as-a-Judge evaluation framework and releasing the 2B-parameter ChartJudge-2B model. This model achieves chart understanding evaluation capabilities close to GPT-4o with a minimal size, balancing cost-effectiveness and evaluation quality.

Section 02

Research Background: Pain Points in Chart Understanding Evaluation

Chart understanding is a key challenge for Vision-Language Models (LVLMs), requiring accurate data extraction and trend comprehension. Existing evaluations rely on manual annotation or closed-source large models (e.g., GPT-4), which are costly. This project explores using open-source LVLMs as 'judges' for chart understanding tasks, constructing a framework and launching the ChartJudge-2B model.

Section 03

Core Methodology: Detailed Explanation of the LVLM-as-a-Judge Evaluation Framework

Multi-Dimensional Evaluation Modes

Pairwise Evaluation: Select the better answer from two candidates
Single-Point Scoring: 1-5 Likert scale scoring
With/Without Reference Evaluation: Optional provision of standard answers

Multi-Criteria Evaluation Dimensions

Factual Correctness: Data consistency with the chart
Information Richness: Sufficiency of the answer's information
Relevance: Alignment with the question
Multi-Dimensional Comprehensive Quality: Overall evaluation

Large-Scale Benchmark Testing

Over 100,000 judgment annotations were conducted on the OpenCQA and VisText datasets. Using GPT-4o and LLaVA-Critic-70B as references, 13 open-source LVLMs (2B-9B parameters) were evaluated.

Section 04

ChartJudge-2B: A Compact Judge Model with Strong Capabilities

Performance

Model	OpenCQA (Pairwise ↑)	VisText L1 (Pairwise ↑)	VisText L2/L3 (Pairwise ↑)
Qwen2-VL-2B (Base Version)	54.0%	27.2%	3.0%
ChartJudge-2B	61.7%	64.6%	52.3%
LLaVA-Critic-7B	79.5%	79.1%	77.1%
ChartJudge-2B shows a significant improvement over the base model, and its VisText L1 performance exceeds that of the 7B model.

Robustness to Multi-Criteria Prompts

Under multi-criteria prompts, the accuracy of 7B models (e.g., LLaVA-Critic) plummets to nearly 0%, while ChartJudge-2B maintains an accuracy of 46.86%.

Deployment Advantages

Speed: 2x faster than 7B judge models
Cost: 2x lower operational cost
Hardware: Can run on GPUs with 8GB VRAM (e.g., T4)

Section 05

Key Findings: Evaluation Potential and Limitations of Open-Source Models

Potential of open-source models: Some 7B open-source LVLMs have chart evaluation capabilities close to GPT-4o (about 80% consistency), making them suitable for privacy-sensitive scenarios.
Limitations of specialized models: Chart-specific models like ChartGemma and PaliGemma have 0% accuracy when used as judges, indicating that specialized understanding ability ≠ general evaluation ability.
Double-edged sword of multi-criteria prompts: They provide rich dimensions but expose model vulnerabilities—7B models almost fail.
Cross-model generalization: ChartJudge-2B was trained using Gemini-1.5-Pro as a reference, but remains stable when evaluated with GPT-4o/LLaVA-Critic-70B.
Correlation with human judgment: LLaVA-Critic-70B has a higher correlation with human judgment (mean error distance of 0.81) than GPT-4o (0.93).
Prevalent biases: All judge models exhibit position bias and length bias.
Power of fine-tuning: After fine-tuning, PaliGemma-3B's VisText pairwise accuracy increased from 0% to 55.9%.

Section 06

Application Value: Reducing Costs and Promoting Evaluation Standardization

Cost reduction: Replaces GPT-4o, providing an economical solution for large-scale evaluations.
Privacy scenarios: Local deployment of open-source models is suitable for enterprises that cannot use external APIs.
Evaluation standardization: Proposes pairwise/single-point scoring, multi-dimensional evaluation paradigms, and metrics, providing references for domain standardization.
Revealing capability boundaries: By comparing over 13 open-source LVLMs, it reveals vulnerabilities under multi-criteria prompts and points out improvement directions.

Section 07

Open-Source Resources: Full Access to Code, Models, and Data

The project's open-source content includes:

Complete implementation of the evaluation framework
ChartJudge-2B model weights
Training dataset (~9.7K single-criteria + ~2.8K multi-criteria)
Evaluation scripts and benchmark testing code
Experiment configurations and hyperparameters Chart image data can be downloaded via the project's Google Drive link.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49