# UniRRM: A Unified Reasoning Reward Model Across Languages and Evaluation Paradigms

> UniRRM is the first unified reasoning reward model supporting 103 languages and three evaluation paradigms (pairwise, listwise, pointwise), achieving high-quality evaluation through dynamic rubric generation and two-stage training.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-23T15:08:14.000Z
- 最近活动: 2026-05-23T15:19:37.953Z
- 热度: 163.8
- 关键词: 奖励模型, 多语言, ICML 2026, LLM评估, GRPO, LLaMA-Factory, 推理模型, 成对评估, 列表评估, 单点评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/unirrm
- Canonical: https://www.zingnex.cn/forum/thread/unirrm
- Markdown 来源: floors_fallback

---

## Introduction / Main Post: UniRRM: A Unified Reasoning Reward Model Across Languages and Evaluation Paradigms

UniRRM is the first unified reasoning reward model supporting 103 languages and three evaluation paradigms (pairwise, listwise, pointwise), achieving high-quality evaluation through dynamic rubric generation and two-stage training.

## Original Authors and Sources

- **Original Author/Maintainer**: Laip11 (SUSTech-NLP Team)
- **Source Platform**: GitHub
- **Original Title**: UniRRM: Unified Reasoning Reward Models Across Languages and Evaluation Paradigms
- **Original Link**: https://github.com/Laip11/UniRRM
- **Paper Link**: https://icml.cc/virtual/2026/poster/61930
- **Publication Time**: ICML 2026 (May 2026)
- **Model Weights**: https://huggingface.co/SUSTech-NLP/UniRRM-8B
- **Dataset**: https://huggingface.co/datasets/SUSTech-NLP/MixReward

---

## Project Background and Motivation

With the rapid development of large language models (LLMs), accurately evaluating the quality of model-generated responses has become a core challenge. Existing reward models typically have the following limitations:

1. **Single-language focus**: Most reward models are designed primarily for English, making it difficult to effectively evaluate responses in other languages
2. **Fragmented evaluation paradigms**: Pairwise comparison, listwise ranking, and pointwise scoring usually require different models or architectures
3. **Fixed rubrics**: Traditional models use predefined rubrics and cannot dynamically adjust based on specific tasks

UniRRM was created to address these issues. As a paper accepted by ICML 2026, it proposes the first unified reasoning reward model that supports 103 languages and three evaluation paradigms simultaneously.

---

## 1. Adaptive Rubric Generation

UniRRM introduces a phased reasoning chain that dynamically generates task-general and instruction-specific evaluation rubrics. This mechanism enables the model to:

- **Deeply analyze input**: Identify potential risks, task types, core requirements, and specific constraints
- **Generate dynamic rubrics**: Create 1-5 point scoring rubrics based on specific inputs
- **Fine-grained evaluation**: Conduct detailed assessments for each scoring dimension, including evidence extraction, gap analysis, and final scoring

## 2. Unified Evaluation Pipeline

This is the most groundbreaking design of UniRRM. Through a unified architecture, the model can handle:

- **Pairwise evaluation**: Compare the quality of two responses
- **Listwise evaluation**: Rank multiple responses
- **Pointwise evaluation**: Assign an absolute score to a single response

Users can switch evaluation modes simply by adjusting the number of `<Response>` blocks in the input:
- 2 blocks → Pairwise evaluation
- 4 blocks → Listwise evaluation
- 1 block → Pointwise evaluation

## 3. Multilingual Support

UniRRM is trained on the **MixReward** dataset, which covers:
- **103 languages**
- **6 domains**

This allows the model to maintain stable evaluation quality across different languages and cultural backgrounds.

---

## Two-Stage Training Pipeline

UniRRM adopts a carefully designed two-stage training strategy:

**Stage 1: Supervised Fine-Tuning (SFT)**

Full fine-tuning is performed based on the LLaMA-Factory framework to build basic evaluation capabilities. This stage allows the model to learn how to:
- Analyze input and identify task types
- Generate appropriate rubrics
- Output evaluation results in a structured format

**Stage 2: Reinforcement Learning (GRPO)**

The verl framework and GRPO (Group Relative Policy Optimization) algorithm are used to further optimize the model's reasoning capabilities. This stage aims to:
- Improve the accuracy and consistency of evaluation
- Enhance the model's judgment ability in complex scenarios
- Optimize generalization performance across languages and paradigms

## Model Performance

UniRRM has achieved near-state-of-the-art (SOTA) performance in multiple benchmark tests:

**Pairwise Evaluation Benchmarks**:
- RWBench: 0.907 (8B) / 0.920 (14B)
- M-RWBench: 0.891 (8B) / 0.910 (14B)
- MM-Eval: 0.857 (8B) / 0.885 (14B)
- JudgeBench: 0.683 (8B) / 0.757 (14B)
- **Average Score**: 0.834 (8B) / 0.868 (14B)

**Listwise Evaluation**:
- RWBench2: 0.753 (8B) / 0.791 (14B)

**Pointwise Evaluation (Unseen During Training)**:
- Average Score: 0.734 (8B) / 0.772 (14B)

Notably, even without dedicated optimization for pointwise evaluation during training, UniRRM still demonstrates good generalization ability.

---