# Perception-Judge: Eliminating Perceptual Judgment Bias in Multimodal LLMs via Perceptual Perturbation and Reward Modeling

> The KAIST research team proposes the Perception-Judge framework, which effectively mitigates the perceptual judgment bias of multimodal large models when acting as judges through the Perceptual Perturbation Dataset (PPJD) and GRPO reinforcement learning training.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-16T09:16:54.000Z
- 最近活动: 2026-06-16T09:21:11.155Z
- 热度: 150.9
- 关键词: 多模态大模型, MLLM-as-a-Judge, 感知判断偏见, GRPO强化学习, PPJD数据集, ICML 2026, 视觉语言模型, 自动评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/perception-judge-llm
- Canonical: https://www.zingnex.cn/forum/thread/perception-judge-llm
- Markdown 来源: floors_fallback

---

## Introduction: The Perception-Judge Framework Addresses Perceptual Judgment Bias in Multimodal LLM Judges

The KAIST research team proposes the Perception-Judge framework, which effectively mitigates the perceptual judgment bias of multimodal large models when acting as judges by constructing the Perceptual Perturbation Dataset (PPJD) and using GRPO reinforcement learning + batch ranking reward training. This framework improves the perceptual fidelity, ranking consistency, and human alignment of judgments, and has open-sourced the dataset, models, and code resources.

## Research Background: Perceptual Judgment Bias in Multimodal LLM Judges

In recent years, multimodal LLMs have performed excellently in tasks such as visual understanding, but they exhibit perceptual judgment bias when acting as automated judges: when visual evidence conflicts with textual clues, they tend to reward seemingly reasonable textual narratives rather than correct answers based on visual perception. This bias leads to evaluations that over-rely on textual fluency and ignore the true understanding of image content—for example, an image description that is inconsistent with the content but fluent still receives a high score.

## Solution: PPJD Dataset and GRPO Training Framework

### PPJD Dataset
Built on MMPR v1.2 annotated data, it generates variant images with minor visual differences but key semantic differences while keeping textual responses unchanged. It is used to isolate perceptual errors and provide supervision signals, containing approximately 3000 training samples and has been released on Hugging Face.

### GRPO Training Framework
It uses the Group Relative Policy Optimization (GRPO) algorithm for fine-tuning, combined with batch ranking reward objectives. It supports full-parameter fine-tuning and LoRA mode, is built based on the verl project, and has released multiple model checkpoints of different scales (e.g., Qwen3-4B, Flex-VL-32B LoRA version).

## Experimental Evidence: Performance Improvement of the Perception-Judge Framework

In the MLLM-Judge benchmark test, this framework achieved significant improvements:
- **Perceptual Fidelity**: More accurately identifies visual-text mismatches and reduces the incidence of bias;
- **Ranking Consistency**: Batch ranking rewards improve global ranking consistency;
- **Human Alignment**: Higher consistency with the judgment results of human experts.
The results prove the effectiveness and generality of the framework.

## Technical Implementation and Open-Source Resources

The project is fully open-source and provides:
- **Code Repository**: Training, data preparation, and evaluation scripts (including GRPO training, PPJD construction, MLLM-Judge evaluation);
- **Pre-trained Models**: Multi-scale models released on Hugging Face;
- **Dataset**: PPJD training and validation sets;
- **Project Page**: Visual demos and technical documentation.
The recommended environment is Python3.10 + CUDA GPU, supporting 8-card training, and a Docker image is provided to solve dependency issues.

## Research Significance and Future Outlook

**Theoretical Significance**: For the first time, it systematically defines and quantifies the perceptual judgment bias of MLLM-as-a-Judge, providing a problem framework and evaluation benchmarks.
**Practical Significance**: Provides a complete solution and lowers the research threshold.
**Future Outlook**: It will have far-reaching impacts in fields such as multimodal content moderation, generative AI evaluation, and human-machine collaboration systems.

## Conclusion: Academic and Application Value of Perception-Judge

Perception-Judge represents an important advancement in the field of multimodal LLM judges. It mitigates perceptual bias through the PPJD dataset and GRPO + batch ranking framework, training judges that are more perceptually grounded, interpretable, and robust. It has both academic value and practical application paths, and the open-source resources will promote community progress.
