Reading

Perception-Judge: Eliminating Perceptual Judgment Bias in Multimodal LLMs via Perceptual Perturbation and Reward Modeling

The KAIST research team proposes the Perception-Judge framework, which effectively mitigates the perceptual judgment bias of multimodal large models when acting as judges through the Perceptual Perturbation Dataset (PPJD) and GRPO reinforcement learning training.

多模态大模型MLLM-as-a-Judge感知判断偏见GRPO强化学习PPJD数据集ICML 2026视觉语言模型自动评估

Published 2026-06-16 17:16Recent activity 2026-06-16 17:21Estimated read 6 min

Perception-Judge: Eliminating Perceptual Judgment Bias in Multimodal LLMs via Perceptual Perturbation and Reward Modeling

Section 01

Introduction: The Perception-Judge Framework Addresses Perceptual Judgment Bias in Multimodal LLM Judges

The KAIST research team proposes the Perception-Judge framework, which effectively mitigates the perceptual judgment bias of multimodal large models when acting as judges by constructing the Perceptual Perturbation Dataset (PPJD) and using GRPO reinforcement learning + batch ranking reward training. This framework improves the perceptual fidelity, ranking consistency, and human alignment of judgments, and has open-sourced the dataset, models, and code resources.

Section 02

Research Background: Perceptual Judgment Bias in Multimodal LLM Judges

In recent years, multimodal LLMs have performed excellently in tasks such as visual understanding, but they exhibit perceptual judgment bias when acting as automated judges: when visual evidence conflicts with textual clues, they tend to reward seemingly reasonable textual narratives rather than correct answers based on visual perception. This bias leads to evaluations that over-rely on textual fluency and ignore the true understanding of image content—for example, an image description that is inconsistent with the content but fluent still receives a high score.

Section 03

Solution: PPJD Dataset and GRPO Training Framework

PPJD Dataset

Built on MMPR v1.2 annotated data, it generates variant images with minor visual differences but key semantic differences while keeping textual responses unchanged. It is used to isolate perceptual errors and provide supervision signals, containing approximately 3000 training samples and has been released on Hugging Face.

GRPO Training Framework

It uses the Group Relative Policy Optimization (GRPO) algorithm for fine-tuning, combined with batch ranking reward objectives. It supports full-parameter fine-tuning and LoRA mode, is built based on the verl project, and has released multiple model checkpoints of different scales (e.g., Qwen3-4B, Flex-VL-32B LoRA version).

Section 04

Experimental Evidence: Performance Improvement of the Perception-Judge Framework

In the MLLM-Judge benchmark test, this framework achieved significant improvements:

Perceptual Fidelity: More accurately identifies visual-text mismatches and reduces the incidence of bias;
Ranking Consistency: Batch ranking rewards improve global ranking consistency;
Human Alignment: Higher consistency with the judgment results of human experts. The results prove the effectiveness and generality of the framework.

Section 05

Technical Implementation and Open-Source Resources

The project is fully open-source and provides:

Code Repository: Training, data preparation, and evaluation scripts (including GRPO training, PPJD construction, MLLM-Judge evaluation);
Pre-trained Models: Multi-scale models released on Hugging Face;
Dataset: PPJD training and validation sets;
Project Page: Visual demos and technical documentation. The recommended environment is Python3.10 + CUDA GPU, supporting 8-card training, and a Docker image is provided to solve dependency issues.

Section 06

Research Significance and Future Outlook

Theoretical Significance: For the first time, it systematically defines and quantifies the perceptual judgment bias of MLLM-as-a-Judge, providing a problem framework and evaluation benchmarks. Practical Significance: Provides a complete solution and lowers the research threshold. Future Outlook: It will have far-reaching impacts in fields such as multimodal content moderation, generative AI evaluation, and human-machine collaboration systems.

Section 07

Conclusion: Academic and Application Value of Perception-Judge

Perception-Judge represents an important advancement in the field of multimodal LLM judges. It mitigates perceptual bias through the PPJD dataset and GRPO + batch ranking framework, training judges that are more perceptually grounded, interpretable, and robust. It has both academic value and practical application paths, and the open-source resources will promote community progress.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23