# StrataRL: A Multi-Domain Reasoning Reinforcement Learning Framework for Small Language Models

> This article introduces the StrataRL framework, which addresses the cross-domain catastrophic forgetting problem in GRPO training through hierarchical advantage normalization and structured template reward mechanisms, enabling small language models to achieve simultaneous improvements in mathematical, commonsense, and strategic reasoning tasks.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-04T11:55:24.000Z
- 最近活动: 2026-06-04T12:21:22.430Z
- 热度: 150.6
- 关键词: GRPO, 强化学习, 小语言模型, 多领域推理, 优势归一化, 结构化奖励, 模型训练, 机器学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/stratarl
- Canonical: https://www.zingnex.cn/forum/thread/stratarl
- Markdown 来源: floors_fallback

---

## StrataRL Framework Overview: Addressing Cross-Domain Forgetting in Multi-Domain Reasoning for Small Models

StrataRL is a multi-domain reasoning reinforcement learning framework for small language models. Targeting the cross-domain catastrophic forgetting problem in GRPO training, it achieves simultaneous improvements in mathematical, commonsense, and strategic reasoning tasks through hierarchical advantage normalization (SAN) and structured template reward (ST-GRPO) mechanisms, avoiding the trade-off phenomenon seen in traditional training.

## Research Background: Cross-Domain Catastrophic Forgetting in GRPO Training

Group Relative Policy Optimization (GRPO) is a mainstream method for training large language models' reasoning capabilities. However, standard GRPO suffers from cross-domain catastrophic forgetting during mixed multi-domain training: when the model improves in one domain (e.g., mathematical reasoning), its performance in another domain (e.g., commonsense QA) declines. The reason is that global advantage normalization compares rewards from easy domains (high rewards) and difficult domains (low rewards) together, leading to the suppression of effective trajectories in difficult domains. StrataRL is exactly the solution to this problem.

## Core Innovations: Hierarchical Advantage Normalization and Structured Template Rewards

### Hierarchical Advantage Normalization (SAN)
Rewards from different domains are normalized within their respective domains. A strategy is dynamically selected based on the batch reward variance: zero variance only centers the rewards, low variance uses damped scaling, and normal variance uses Z-normalization, avoiding cross-domain gradient bias.
### Structured Template Reward (ST-GRPO)
Specific reasoning templates are defined for each domain (e.g., math requires tags like `<decompose>`). The output structure is verified via regular expressions, eliminating the need for an external reward model and providing a reliable signal of reasoning quality.

## Training Architecture: Adaptive Sampling and Composite Reward Design

Key links in the training process:
1. **UCB Curriculum Sampler**: Adaptive domain scheduling, prioritizing domains where the model performs poorly;
2. **Rollout Engine**: Supports Hugging Face (local M4) and vLLM (GPU environment) backends;
3. **Composite Reward**: Result reward (numeric/alphabetic/yes-no verification), structure reward (template check), repetition penalty;
4. **GRPO Loss**: Efficient training with QLoRA; no frozen reference model saves memory; log ratio clipping and precise KL alignment ensure stability.

## Experimental Results: Simultaneous Improvement in Multi-Domain Reasoning Capabilities

Baselines were measured strictly following the training prompt template (GSM8K: 0.500, MMLU: 0.300). After optimization, the Qwen2.5-3B-Instruct model achieved:
- GSM8K mathematical reasoning improved by about 10% to over 0.600;
- MMLU commonsense QA improved by about 10% to over 0.400;
- StrategyQA strategic reasoning improved by about 5% to over 0.950;
All domains improved simultaneously without cross-domain forgetting.

## Ablation Experiments: Verification of Component Necessity

Key findings from ablation experiments:
- Removing SAN leads to a significant drop in training stability for difficult domains;
- Pure result rewards perform poorly in multi-step reasoning domains;
- Inaccurate old policy probabilities cause KL drift and training instability;
- Fixed noise intensity causes temporal drift, which is effectively mitigated by an annealing strategy.

## Limitations and Future Improvement Directions

### Limitations
- High computational resource requirements; local M4 only supports small batch sizes;
- Template design is highly domain-specific, requiring manual design for expansion;
- Sparse rewards in some domains affect convergence.
### Future Directions
- Develop a general method for generating structural reward templates;
- Explore adaptive domain weight adjustment strategies;
- Expand to larger models (7B, 13B).