# Perceptual Judgment Bias in Multimodal Large Model Evaluation: Problem Identification and Solutions

> This article introduces a study on the perceptual judgment bias of Multimodal Large Language Models (MLLMs) when used as automatic evaluators, and proposes methods to mitigate this bias through perceptual perturbation and reward modeling.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-01T17:59:46.000Z
- 最近活动: 2026-06-02T05:18:13.768Z
- 热度: 143.7
- 关键词: 多模态大语言模型, MLLM, 自动评判器, 感知判断偏差, 视觉-语言模型, 强化学习, GRPO, 模型评估, 机器学习, 人工智能
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2606-02578v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2606-02578v1
- Markdown 来源: floors_fallback

---

## [Introduction] Perceptual Judgment Bias in Multimodal Large Model Evaluation and Its Solutions

## Key Takeaways
This study focuses on the **perceptual judgment bias** of Multimodal Large Language Models (MLLMs) when acting as automatic evaluators:
1. Problem: MLLM evaluators are easily misled by text fluency, ignoring the authenticity of visual content, leading to inconsistent and unverifiable evaluations;
2. Solution: Proposes the construction method of the **Perceptual Perturbation Judgment Dataset (PPJ Dataset)**, combined with a training framework using **GRPO reinforcement learning** and **batch ranking objectives**;
3. Effect: Significantly improves the evaluator's perceptual fidelity, ranking consistency, and alignment with human evaluations.

## Research Background and Definition of Perceptual Judgment Bias

## Research Background and Problem Definition
### Background
In recent years, MLLMs have enhanced their capabilities in vision-language tasks and are being explored as automatic evaluators (to assess the quality of answers from other models).
### Perceptual Judgment Bias
When visual evidence conflicts with text clues, MLLM evaluators tend to reward answers that "sound reasonable but are inconsistent with visuals", which is essentially being influenced by the surface rationality of text and ignoring visual verification.

## Innovative Dataset: Construction of the Perceptual Perturbation Judgment Dataset (PPJ Dataset)

## Perceptual Perturbation Judgment Dataset (PPJ Dataset)
### Construction Idea
Starting from correct visual-text pairs, make targeted modifications to images to generate "counterfactual answers" (textually reasonable but visually incorrect), forming paired samples (perceptually correct vs. textually reasonable but incorrect).
### Advantages
Provides **verifiable supervision signals**: correctness is based on objective image facts rather than subjective judgment, improving the interpretability and reliability of the evaluator.

## Unified Training Framework: Synergistic Effect of GRPO and Batch Ranking

## Unified Training Framework: GRPO + Batch Ranking
### GRPO Structured Reward
Uses the Group Relative Policy Optimization (GRPO) reinforcement learning algorithm to optimize strategies by comparing the relative quality of candidate answers, guiding the model to focus on visual authenticity.
### Batch Ranking Objective
Without paired labels, learns a globally consistent scoring function through batch samples to improve ranking consistency.
### Synergistic Effect
GRPO provides fine-grained differentiation ability, while batch ranking ensures global consistency—together enhancing the evaluator's performance.

## Experimental Results: Improved Perceptual Fidelity, Consistency, and Human Alignment

## Experimental Validation Results
### Improved Perceptual Fidelity
Can more accurately identify visual-text inconsistencies and give low scores to incorrect answers.
### Improved Ranking Consistency
When faced with different arrangements of the same set of answers, the ranking results are stable.
### Improved Human Alignment
The correlation with human expert scores is significantly enhanced.

## Practical Significance: Improving Automatic Evaluation Reliability and Reducing Annotation Costs

## Practical Significance and Application Prospects
1. **Automatic Evaluation Reliability**: Improves the credibility of MLLM-as-a-Judge results, aiding model selection and monitoring;
2. **Reduced Annotation Costs**: Efficiently generates training data through perceptual perturbation;
3. **Enhanced Interpretability**: Decisions can be traced back to specific visual-text inconsistencies;
4. **Robust System Construction**: Provides a scalable approach to resolving perception-reasoning conflicts.

## Conclusion and Outlook: Direction of Perceptually Grounded Multimodal Evaluators

## Conclusion and Outlook
### Conclusion
This study effectively mitigates perceptual judgment bias and improves evaluator performance through systematic problem identification, innovative datasets, and training frameworks.
### Outlook
Can be extended to complex scenarios such as video understanding and multi-image reasoning, as well as application fields like autonomous driving and medical image diagnosis, promoting multimodal AI systems to be more perceptually grounded and reliable.