Zing Forum

Reading

StrataRL: A Multi-Domain Reasoning Reinforcement Learning Framework for Small Language Models

This article introduces the StrataRL framework, which addresses the cross-domain catastrophic forgetting problem in GRPO training through hierarchical advantage normalization and structured template reward mechanisms, enabling small language models to achieve simultaneous improvements in mathematical, commonsense, and strategic reasoning tasks.

GRPO强化学习小语言模型多领域推理优势归一化结构化奖励模型训练机器学习
Published 2026-06-04 19:55Recent activity 2026-06-04 20:21Estimated read 6 min
StrataRL: A Multi-Domain Reasoning Reinforcement Learning Framework for Small Language Models
1

Section 01

StrataRL Framework Overview: Addressing Cross-Domain Forgetting in Multi-Domain Reasoning for Small Models

StrataRL is a multi-domain reasoning reinforcement learning framework for small language models. Targeting the cross-domain catastrophic forgetting problem in GRPO training, it achieves simultaneous improvements in mathematical, commonsense, and strategic reasoning tasks through hierarchical advantage normalization (SAN) and structured template reward (ST-GRPO) mechanisms, avoiding the trade-off phenomenon seen in traditional training.

2

Section 02

Research Background: Cross-Domain Catastrophic Forgetting in GRPO Training

Group Relative Policy Optimization (GRPO) is a mainstream method for training large language models' reasoning capabilities. However, standard GRPO suffers from cross-domain catastrophic forgetting during mixed multi-domain training: when the model improves in one domain (e.g., mathematical reasoning), its performance in another domain (e.g., commonsense QA) declines. The reason is that global advantage normalization compares rewards from easy domains (high rewards) and difficult domains (low rewards) together, leading to the suppression of effective trajectories in difficult domains. StrataRL is exactly the solution to this problem.

3

Section 03

Core Innovations: Hierarchical Advantage Normalization and Structured Template Rewards

Hierarchical Advantage Normalization (SAN)

Rewards from different domains are normalized within their respective domains. A strategy is dynamically selected based on the batch reward variance: zero variance only centers the rewards, low variance uses damped scaling, and normal variance uses Z-normalization, avoiding cross-domain gradient bias.

Structured Template Reward (ST-GRPO)

Specific reasoning templates are defined for each domain (e.g., math requires tags like <decompose>). The output structure is verified via regular expressions, eliminating the need for an external reward model and providing a reliable signal of reasoning quality.

4

Section 04

Training Architecture: Adaptive Sampling and Composite Reward Design

Key links in the training process:

  1. UCB Curriculum Sampler: Adaptive domain scheduling, prioritizing domains where the model performs poorly;
  2. Rollout Engine: Supports Hugging Face (local M4) and vLLM (GPU environment) backends;
  3. Composite Reward: Result reward (numeric/alphabetic/yes-no verification), structure reward (template check), repetition penalty;
  4. GRPO Loss: Efficient training with QLoRA; no frozen reference model saves memory; log ratio clipping and precise KL alignment ensure stability.
5

Section 05

Experimental Results: Simultaneous Improvement in Multi-Domain Reasoning Capabilities

Baselines were measured strictly following the training prompt template (GSM8K: 0.500, MMLU: 0.300). After optimization, the Qwen2.5-3B-Instruct model achieved:

  • GSM8K mathematical reasoning improved by about 10% to over 0.600;
  • MMLU commonsense QA improved by about 10% to over 0.400;
  • StrategyQA strategic reasoning improved by about 5% to over 0.950; All domains improved simultaneously without cross-domain forgetting.
6

Section 06

Ablation Experiments: Verification of Component Necessity

Key findings from ablation experiments:

  • Removing SAN leads to a significant drop in training stability for difficult domains;
  • Pure result rewards perform poorly in multi-step reasoning domains;
  • Inaccurate old policy probabilities cause KL drift and training instability;
  • Fixed noise intensity causes temporal drift, which is effectively mitigated by an annealing strategy.
7

Section 07

Limitations and Future Improvement Directions

Limitations

  • High computational resource requirements; local M4 only supports small batch sizes;
  • Template design is highly domain-specific, requiring manual design for expansion;
  • Sparse rewards in some domains affect convergence.

Future Directions

  • Develop a general method for generating structural reward templates;
  • Explore adaptive domain weight adjustment strategies;
  • Expand to larger models (7B, 13B).