UniRRM adopts a carefully designed two-stage training strategy:
Stage 1: Supervised Fine-Tuning (SFT)
Full fine-tuning is performed based on the LLaMA-Factory framework to build basic evaluation capabilities. This stage allows the model to learn how to:
- Analyze input and identify task types
- Generate appropriate rubrics
- Output evaluation results in a structured format
Stage 2: Reinforcement Learning (GRPO)
The verl framework and GRPO (Group Relative Policy Optimization) algorithm are used to further optimize the model's reasoning capabilities. This stage aims to:
- Improve the accuracy and consistency of evaluation
- Enhance the model's judgment ability in complex scenarios
- Optimize generalization performance across languages and paradigms