Zing Forum

Reading

Perception-Judge: Eliminating Perceptual Judgment Bias in Multimodal LLMs via Perceptual Perturbation and Reward Modeling

The KAIST research team proposes the Perception-Judge framework, which effectively mitigates the perceptual judgment bias of multimodal large models when acting as judges through the Perceptual Perturbation Dataset (PPJD) and GRPO reinforcement learning training.

多模态大模型MLLM-as-a-Judge感知判断偏见GRPO强化学习PPJD数据集ICML 2026视觉语言模型自动评估
Published 2026-06-16 17:16Recent activity 2026-06-16 17:21Estimated read 6 min
Perception-Judge: Eliminating Perceptual Judgment Bias in Multimodal LLMs via Perceptual Perturbation and Reward Modeling
1

Section 01

Introduction: The Perception-Judge Framework Addresses Perceptual Judgment Bias in Multimodal LLM Judges

The KAIST research team proposes the Perception-Judge framework, which effectively mitigates the perceptual judgment bias of multimodal large models when acting as judges by constructing the Perceptual Perturbation Dataset (PPJD) and using GRPO reinforcement learning + batch ranking reward training. This framework improves the perceptual fidelity, ranking consistency, and human alignment of judgments, and has open-sourced the dataset, models, and code resources.

2

Section 02

Research Background: Perceptual Judgment Bias in Multimodal LLM Judges

In recent years, multimodal LLMs have performed excellently in tasks such as visual understanding, but they exhibit perceptual judgment bias when acting as automated judges: when visual evidence conflicts with textual clues, they tend to reward seemingly reasonable textual narratives rather than correct answers based on visual perception. This bias leads to evaluations that over-rely on textual fluency and ignore the true understanding of image content—for example, an image description that is inconsistent with the content but fluent still receives a high score.

3

Section 03

Solution: PPJD Dataset and GRPO Training Framework

PPJD Dataset

Built on MMPR v1.2 annotated data, it generates variant images with minor visual differences but key semantic differences while keeping textual responses unchanged. It is used to isolate perceptual errors and provide supervision signals, containing approximately 3000 training samples and has been released on Hugging Face.

GRPO Training Framework

It uses the Group Relative Policy Optimization (GRPO) algorithm for fine-tuning, combined with batch ranking reward objectives. It supports full-parameter fine-tuning and LoRA mode, is built based on the verl project, and has released multiple model checkpoints of different scales (e.g., Qwen3-4B, Flex-VL-32B LoRA version).

4

Section 04

Experimental Evidence: Performance Improvement of the Perception-Judge Framework

In the MLLM-Judge benchmark test, this framework achieved significant improvements:

  • Perceptual Fidelity: More accurately identifies visual-text mismatches and reduces the incidence of bias;
  • Ranking Consistency: Batch ranking rewards improve global ranking consistency;
  • Human Alignment: Higher consistency with the judgment results of human experts. The results prove the effectiveness and generality of the framework.
5

Section 05

Technical Implementation and Open-Source Resources

The project is fully open-source and provides:

  • Code Repository: Training, data preparation, and evaluation scripts (including GRPO training, PPJD construction, MLLM-Judge evaluation);
  • Pre-trained Models: Multi-scale models released on Hugging Face;
  • Dataset: PPJD training and validation sets;
  • Project Page: Visual demos and technical documentation. The recommended environment is Python3.10 + CUDA GPU, supporting 8-card training, and a Docker image is provided to solve dependency issues.
6

Section 06

Research Significance and Future Outlook

Theoretical Significance: For the first time, it systematically defines and quantifies the perceptual judgment bias of MLLM-as-a-Judge, providing a problem framework and evaluation benchmarks. Practical Significance: Provides a complete solution and lowers the research threshold. Future Outlook: It will have far-reaching impacts in fields such as multimodal content moderation, generative AI evaluation, and human-machine collaboration systems.

7

Section 07

Conclusion: Academic and Application Value of Perception-Judge

Perception-Judge represents an important advancement in the field of multimodal LLM judges. It mitigates perceptual bias through the PPJD dataset and GRPO + batch ranking framework, training judges that are more perceptually grounded, interpretable, and robust. It has both academic value and practical application paths, and the open-source resources will promote community progress.