Zing Forum

Reading

Perceptual Judgment Bias in Multimodal Large Model Evaluation: Problem Identification and Solutions

This article introduces a study on the perceptual judgment bias of Multimodal Large Language Models (MLLMs) when used as automatic evaluators, and proposes methods to mitigate this bias through perceptual perturbation and reward modeling.

多模态大语言模型MLLM自动评判器感知判断偏差视觉-语言模型强化学习GRPO模型评估机器学习人工智能
Published 2026-06-02 01:59Recent activity 2026-06-02 13:18Estimated read 6 min
Perceptual Judgment Bias in Multimodal Large Model Evaluation: Problem Identification and Solutions
1

Section 01

[Introduction] Perceptual Judgment Bias in Multimodal Large Model Evaluation and Its Solutions

Key Takeaways

This study focuses on the perceptual judgment bias of Multimodal Large Language Models (MLLMs) when acting as automatic evaluators:

  1. Problem: MLLM evaluators are easily misled by text fluency, ignoring the authenticity of visual content, leading to inconsistent and unverifiable evaluations;
  2. Solution: Proposes the construction method of the Perceptual Perturbation Judgment Dataset (PPJ Dataset), combined with a training framework using GRPO reinforcement learning and batch ranking objectives;
  3. Effect: Significantly improves the evaluator's perceptual fidelity, ranking consistency, and alignment with human evaluations.
2

Section 02

Research Background and Definition of Perceptual Judgment Bias

Research Background and Problem Definition

Background

In recent years, MLLMs have enhanced their capabilities in vision-language tasks and are being explored as automatic evaluators (to assess the quality of answers from other models).

Perceptual Judgment Bias

When visual evidence conflicts with text clues, MLLM evaluators tend to reward answers that "sound reasonable but are inconsistent with visuals", which is essentially being influenced by the surface rationality of text and ignoring visual verification.

3

Section 03

Innovative Dataset: Construction of the Perceptual Perturbation Judgment Dataset (PPJ Dataset)

Perceptual Perturbation Judgment Dataset (PPJ Dataset)

Construction Idea

Starting from correct visual-text pairs, make targeted modifications to images to generate "counterfactual answers" (textually reasonable but visually incorrect), forming paired samples (perceptually correct vs. textually reasonable but incorrect).

Advantages

Provides verifiable supervision signals: correctness is based on objective image facts rather than subjective judgment, improving the interpretability and reliability of the evaluator.

4

Section 04

Unified Training Framework: Synergistic Effect of GRPO and Batch Ranking

Unified Training Framework: GRPO + Batch Ranking

GRPO Structured Reward

Uses the Group Relative Policy Optimization (GRPO) reinforcement learning algorithm to optimize strategies by comparing the relative quality of candidate answers, guiding the model to focus on visual authenticity.

Batch Ranking Objective

Without paired labels, learns a globally consistent scoring function through batch samples to improve ranking consistency.

Synergistic Effect

GRPO provides fine-grained differentiation ability, while batch ranking ensures global consistency—together enhancing the evaluator's performance.

5

Section 05

Experimental Results: Improved Perceptual Fidelity, Consistency, and Human Alignment

Experimental Validation Results

Improved Perceptual Fidelity

Can more accurately identify visual-text inconsistencies and give low scores to incorrect answers.

Improved Ranking Consistency

When faced with different arrangements of the same set of answers, the ranking results are stable.

Improved Human Alignment

The correlation with human expert scores is significantly enhanced.

6

Section 06

Practical Significance: Improving Automatic Evaluation Reliability and Reducing Annotation Costs

Practical Significance and Application Prospects

  1. Automatic Evaluation Reliability: Improves the credibility of MLLM-as-a-Judge results, aiding model selection and monitoring;
  2. Reduced Annotation Costs: Efficiently generates training data through perceptual perturbation;
  3. Enhanced Interpretability: Decisions can be traced back to specific visual-text inconsistencies;
  4. Robust System Construction: Provides a scalable approach to resolving perception-reasoning conflicts.
7

Section 07

Conclusion and Outlook: Direction of Perceptually Grounded Multimodal Evaluators

Conclusion and Outlook

Conclusion

This study effectively mitigates perceptual judgment bias and improves evaluator performance through systematic problem identification, innovative datasets, and training frameworks.

Outlook

Can be extended to complex scenarios such as video understanding and multi-image reasoning, as well as application fields like autonomous driving and medical image diagnosis, promoting multimodal AI systems to be more perceptually grounded and reliable.