Reading

Perceptual Judgment Bias in Multimodal Large Model Evaluation: Problem Identification and Solutions

This article introduces a study on the perceptual judgment bias of Multimodal Large Language Models (MLLMs) when used as automatic evaluators, and proposes methods to mitigate this bias through perceptual perturbation and reward modeling.

多模态大语言模型MLLM自动评判器感知判断偏差视觉-语言模型强化学习GRPO模型评估机器学习人工智能

Published 2026-06-02 01:59Recent activity 2026-06-02 13:18Estimated read 6 min

Perceptual Judgment Bias in Multimodal Large Model Evaluation: Problem Identification and Solutions

Section 01

[Introduction] Perceptual Judgment Bias in Multimodal Large Model Evaluation and Its Solutions

Key Takeaways

This study focuses on the perceptual judgment bias of Multimodal Large Language Models (MLLMs) when acting as automatic evaluators:

Problem: MLLM evaluators are easily misled by text fluency, ignoring the authenticity of visual content, leading to inconsistent and unverifiable evaluations;
Solution: Proposes the construction method of the Perceptual Perturbation Judgment Dataset (PPJ Dataset), combined with a training framework using GRPO reinforcement learning and batch ranking objectives;
Effect: Significantly improves the evaluator's perceptual fidelity, ranking consistency, and alignment with human evaluations.

Section 02

Research Background and Definition of Perceptual Judgment Bias

Research Background and Problem Definition

Background

In recent years, MLLMs have enhanced their capabilities in vision-language tasks and are being explored as automatic evaluators (to assess the quality of answers from other models).

Perceptual Judgment Bias

When visual evidence conflicts with text clues, MLLM evaluators tend to reward answers that "sound reasonable but are inconsistent with visuals", which is essentially being influenced by the surface rationality of text and ignoring visual verification.

Section 03

Innovative Dataset: Construction of the Perceptual Perturbation Judgment Dataset (PPJ Dataset)

Perceptual Perturbation Judgment Dataset (PPJ Dataset)

Construction Idea

Starting from correct visual-text pairs, make targeted modifications to images to generate "counterfactual answers" (textually reasonable but visually incorrect), forming paired samples (perceptually correct vs. textually reasonable but incorrect).

Advantages

Provides verifiable supervision signals: correctness is based on objective image facts rather than subjective judgment, improving the interpretability and reliability of the evaluator.

Section 04

Unified Training Framework: Synergistic Effect of GRPO and Batch Ranking

Unified Training Framework: GRPO + Batch Ranking

GRPO Structured Reward

Uses the Group Relative Policy Optimization (GRPO) reinforcement learning algorithm to optimize strategies by comparing the relative quality of candidate answers, guiding the model to focus on visual authenticity.

Batch Ranking Objective

Without paired labels, learns a globally consistent scoring function through batch samples to improve ranking consistency.

Synergistic Effect

GRPO provides fine-grained differentiation ability, while batch ranking ensures global consistency—together enhancing the evaluator's performance.

Section 05

Experimental Results: Improved Perceptual Fidelity, Consistency, and Human Alignment

Experimental Validation Results

Improved Perceptual Fidelity

Can more accurately identify visual-text inconsistencies and give low scores to incorrect answers.

Improved Ranking Consistency

When faced with different arrangements of the same set of answers, the ranking results are stable.

Improved Human Alignment

The correlation with human expert scores is significantly enhanced.

Section 06

Practical Significance: Improving Automatic Evaluation Reliability and Reducing Annotation Costs

Practical Significance and Application Prospects

Automatic Evaluation Reliability: Improves the credibility of MLLM-as-a-Judge results, aiding model selection and monitoring;
Reduced Annotation Costs: Efficiently generates training data through perceptual perturbation;
Enhanced Interpretability: Decisions can be traced back to specific visual-text inconsistencies;
Robust System Construction: Provides a scalable approach to resolving perception-reasoning conflicts.

Section 07

Conclusion and Outlook: Direction of Perceptually Grounded Multimodal Evaluators

Conclusion and Outlook

Conclusion

This study effectively mitigates perceptual judgment bias and improves evaluator performance through systematic problem identification, innovative datasets, and training frameworks.

Outlook

Can be extended to complex scenarios such as video understanding and multi-image reasoning, as well as application fields like autonomous driving and medical image diagnosis, promoting multimodal AI systems to be more perceptually grounded and reliable.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15