Zing Forum

Reading

UniRRM: A Unified Reasoning Reward Model Across Languages and Evaluation Paradigms

UniRRM is the first unified reasoning reward model supporting 103 languages and three evaluation paradigms (pairwise, listwise, pointwise), achieving high-quality evaluation through dynamic rubric generation and two-stage training.

奖励模型多语言ICML 2026LLM评估GRPOLLaMA-Factory推理模型成对评估列表评估单点评估
Published 2026-05-23 23:08Recent activity 2026-05-23 23:19Estimated read 7 min
UniRRM: A Unified Reasoning Reward Model Across Languages and Evaluation Paradigms
1

Section 01

Introduction / Main Post: UniRRM: A Unified Reasoning Reward Model Across Languages and Evaluation Paradigms

UniRRM is the first unified reasoning reward model supporting 103 languages and three evaluation paradigms (pairwise, listwise, pointwise), achieving high-quality evaluation through dynamic rubric generation and two-stage training.

3

Section 03

Project Background and Motivation

With the rapid development of large language models (LLMs), accurately evaluating the quality of model-generated responses has become a core challenge. Existing reward models typically have the following limitations:

  1. Single-language focus: Most reward models are designed primarily for English, making it difficult to effectively evaluate responses in other languages
  2. Fragmented evaluation paradigms: Pairwise comparison, listwise ranking, and pointwise scoring usually require different models or architectures
  3. Fixed rubrics: Traditional models use predefined rubrics and cannot dynamically adjust based on specific tasks

UniRRM was created to address these issues. As a paper accepted by ICML 2026, it proposes the first unified reasoning reward model that supports 103 languages and three evaluation paradigms simultaneously.


4

Section 04

1. Adaptive Rubric Generation

UniRRM introduces a phased reasoning chain that dynamically generates task-general and instruction-specific evaluation rubrics. This mechanism enables the model to:

  • Deeply analyze input: Identify potential risks, task types, core requirements, and specific constraints
  • Generate dynamic rubrics: Create 1-5 point scoring rubrics based on specific inputs
  • Fine-grained evaluation: Conduct detailed assessments for each scoring dimension, including evidence extraction, gap analysis, and final scoring
5

Section 05

2. Unified Evaluation Pipeline

This is the most groundbreaking design of UniRRM. Through a unified architecture, the model can handle:

  • Pairwise evaluation: Compare the quality of two responses
  • Listwise evaluation: Rank multiple responses
  • Pointwise evaluation: Assign an absolute score to a single response

Users can switch evaluation modes simply by adjusting the number of <Response> blocks in the input:

  • 2 blocks → Pairwise evaluation
  • 4 blocks → Listwise evaluation
  • 1 block → Pointwise evaluation
6

Section 06

3. Multilingual Support

UniRRM is trained on the MixReward dataset, which covers:

  • 103 languages
  • 6 domains

This allows the model to maintain stable evaluation quality across different languages and cultural backgrounds.


7

Section 07

Two-Stage Training Pipeline

UniRRM adopts a carefully designed two-stage training strategy:

Stage 1: Supervised Fine-Tuning (SFT)

Full fine-tuning is performed based on the LLaMA-Factory framework to build basic evaluation capabilities. This stage allows the model to learn how to:

  • Analyze input and identify task types
  • Generate appropriate rubrics
  • Output evaluation results in a structured format

Stage 2: Reinforcement Learning (GRPO)

The verl framework and GRPO (Group Relative Policy Optimization) algorithm are used to further optimize the model's reasoning capabilities. This stage aims to:

  • Improve the accuracy and consistency of evaluation
  • Enhance the model's judgment ability in complex scenarios
  • Optimize generalization performance across languages and paradigms
8

Section 08

Model Performance

UniRRM has achieved near-state-of-the-art (SOTA) performance in multiple benchmark tests:

Pairwise Evaluation Benchmarks:

  • RWBench: 0.907 (8B) / 0.920 (14B)
  • M-RWBench: 0.891 (8B) / 0.910 (14B)
  • MM-Eval: 0.857 (8B) / 0.885 (14B)
  • JudgeBench: 0.683 (8B) / 0.757 (14B)
  • Average Score: 0.834 (8B) / 0.868 (14B)

Listwise Evaluation:

  • RWBench2: 0.753 (8B) / 0.791 (14B)

Pointwise Evaluation (Unseen During Training):

  • Average Score: 0.734 (8B) / 0.772 (14B)

Notably, even without dedicated optimization for pointwise evaluation during training, UniRRM still demonstrates good generalization ability.