Reading

UniRRM: A Unified Reasoning Reward Model Across Languages and Evaluation Paradigms

UniRRM is the first unified reasoning reward model supporting 103 languages and three evaluation paradigms (pairwise, listwise, pointwise), achieving high-quality evaluation through dynamic rubric generation and two-stage training.

奖励模型多语言ICML 2026LLM评估GRPOLLaMA-Factory推理模型成对评估列表评估单点评估

Published 2026-05-23 23:08Recent activity 2026-05-23 23:19Estimated read 7 min

Section 01

Introduction / Main Post: UniRRM: A Unified Reasoning Reward Model Across Languages and Evaluation Paradigms

Section 02

Original Authors and Sources

Original Author/Maintainer: Laip11 (SUSTech-NLP Team)
Source Platform: GitHub
Original Title: UniRRM: Unified Reasoning Reward Models Across Languages and Evaluation Paradigms
Original Link: https://github.com/Laip11/UniRRM
Paper Link: https://icml.cc/virtual/2026/poster/61930
Publication Time: ICML 2026 (May 2026)
Model Weights: https://huggingface.co/SUSTech-NLP/UniRRM-8B
Dataset: https://huggingface.co/datasets/SUSTech-NLP/MixReward

Section 03

Project Background and Motivation

With the rapid development of large language models (LLMs), accurately evaluating the quality of model-generated responses has become a core challenge. Existing reward models typically have the following limitations:

Single-language focus: Most reward models are designed primarily for English, making it difficult to effectively evaluate responses in other languages
Fragmented evaluation paradigms: Pairwise comparison, listwise ranking, and pointwise scoring usually require different models or architectures
Fixed rubrics: Traditional models use predefined rubrics and cannot dynamically adjust based on specific tasks

UniRRM was created to address these issues. As a paper accepted by ICML 2026, it proposes the first unified reasoning reward model that supports 103 languages and three evaluation paradigms simultaneously.

Section 04

1. Adaptive Rubric Generation

UniRRM introduces a phased reasoning chain that dynamically generates task-general and instruction-specific evaluation rubrics. This mechanism enables the model to:

Deeply analyze input: Identify potential risks, task types, core requirements, and specific constraints
Generate dynamic rubrics: Create 1-5 point scoring rubrics based on specific inputs
Fine-grained evaluation: Conduct detailed assessments for each scoring dimension, including evidence extraction, gap analysis, and final scoring

Section 05

2. Unified Evaluation Pipeline

This is the most groundbreaking design of UniRRM. Through a unified architecture, the model can handle:

Pairwise evaluation: Compare the quality of two responses
Listwise evaluation: Rank multiple responses
Pointwise evaluation: Assign an absolute score to a single response

Users can switch evaluation modes simply by adjusting the number of <Response> blocks in the input:

2 blocks → Pairwise evaluation
4 blocks → Listwise evaluation
1 block → Pointwise evaluation

Section 06

3. Multilingual Support

UniRRM is trained on the MixReward dataset, which covers:

103 languages
6 domains

This allows the model to maintain stable evaluation quality across different languages and cultural backgrounds.

Section 07

Two-Stage Training Pipeline

UniRRM adopts a carefully designed two-stage training strategy:

Stage 1: Supervised Fine-Tuning (SFT)

Full fine-tuning is performed based on the LLaMA-Factory framework to build basic evaluation capabilities. This stage allows the model to learn how to:

Analyze input and identify task types
Generate appropriate rubrics
Output evaluation results in a structured format

Stage 2: Reinforcement Learning (GRPO)

The verl framework and GRPO (Group Relative Policy Optimization) algorithm are used to further optimize the model's reasoning capabilities. This stage aims to:

Improve the accuracy and consistency of evaluation
Enhance the model's judgment ability in complex scenarios
Optimize generalization performance across languages and paradigms

Section 08

Model Performance

UniRRM has achieved near-state-of-the-art (SOTA) performance in multiple benchmark tests:

Pairwise Evaluation Benchmarks:

RWBench: 0.907 (8B) / 0.920 (14B)
M-RWBench: 0.891 (8B) / 0.910 (14B)
MM-Eval: 0.857 (8B) / 0.885 (14B)
JudgeBench: 0.683 (8B) / 0.757 (14B)
Average Score: 0.834 (8B) / 0.868 (14B)

Listwise Evaluation:

RWBench2: 0.753 (8B) / 0.791 (14B)

Pointwise Evaluation (Unseen During Training):

Average Score: 0.734 (8B) / 0.772 (14B)

Notably, even without dedicated optimization for pointwise evaluation during training, UniRRM still demonstrates good generalization ability.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15