Reading

StrataRL: A Multi-Domain Reasoning Reinforcement Learning Framework for Small Language Models

This article introduces the StrataRL framework, which addresses the cross-domain catastrophic forgetting problem in GRPO training through hierarchical advantage normalization and structured template reward mechanisms, enabling small language models to achieve simultaneous improvements in mathematical, commonsense, and strategic reasoning tasks.

GRPO强化学习小语言模型多领域推理优势归一化结构化奖励模型训练机器学习

Published 2026-06-04 19:55Recent activity 2026-06-04 20:21Estimated read 6 min

StrataRL: A Multi-Domain Reasoning Reinforcement Learning Framework for Small Language Models

Section 01

StrataRL Framework Overview: Addressing Cross-Domain Forgetting in Multi-Domain Reasoning for Small Models

StrataRL is a multi-domain reasoning reinforcement learning framework for small language models. Targeting the cross-domain catastrophic forgetting problem in GRPO training, it achieves simultaneous improvements in mathematical, commonsense, and strategic reasoning tasks through hierarchical advantage normalization (SAN) and structured template reward (ST-GRPO) mechanisms, avoiding the trade-off phenomenon seen in traditional training.

Section 02

Research Background: Cross-Domain Catastrophic Forgetting in GRPO Training

Group Relative Policy Optimization (GRPO) is a mainstream method for training large language models' reasoning capabilities. However, standard GRPO suffers from cross-domain catastrophic forgetting during mixed multi-domain training: when the model improves in one domain (e.g., mathematical reasoning), its performance in another domain (e.g., commonsense QA) declines. The reason is that global advantage normalization compares rewards from easy domains (high rewards) and difficult domains (low rewards) together, leading to the suppression of effective trajectories in difficult domains. StrataRL is exactly the solution to this problem.

Section 03

Core Innovations: Hierarchical Advantage Normalization and Structured Template Rewards

Hierarchical Advantage Normalization (SAN)

Rewards from different domains are normalized within their respective domains. A strategy is dynamically selected based on the batch reward variance: zero variance only centers the rewards, low variance uses damped scaling, and normal variance uses Z-normalization, avoiding cross-domain gradient bias.

Structured Template Reward (ST-GRPO)

Specific reasoning templates are defined for each domain (e.g., math requires tags like <decompose>). The output structure is verified via regular expressions, eliminating the need for an external reward model and providing a reliable signal of reasoning quality.

Section 04

Training Architecture: Adaptive Sampling and Composite Reward Design

Key links in the training process:

UCB Curriculum Sampler: Adaptive domain scheduling, prioritizing domains where the model performs poorly;
Rollout Engine: Supports Hugging Face (local M4) and vLLM (GPU environment) backends;
Composite Reward: Result reward (numeric/alphabetic/yes-no verification), structure reward (template check), repetition penalty;
GRPO Loss: Efficient training with QLoRA; no frozen reference model saves memory; log ratio clipping and precise KL alignment ensure stability.

Section 05

Experimental Results: Simultaneous Improvement in Multi-Domain Reasoning Capabilities

Baselines were measured strictly following the training prompt template (GSM8K: 0.500, MMLU: 0.300). After optimization, the Qwen2.5-3B-Instruct model achieved:

GSM8K mathematical reasoning improved by about 10% to over 0.600;
MMLU commonsense QA improved by about 10% to over 0.400;
StrategyQA strategic reasoning improved by about 5% to over 0.950; All domains improved simultaneously without cross-domain forgetting.

Section 06

Ablation Experiments: Verification of Component Necessity

Key findings from ablation experiments:

Removing SAN leads to a significant drop in training stability for difficult domains;
Pure result rewards perform poorly in multi-step reasoning domains;
Inaccurate old policy probabilities cause KL drift and training instability;
Fixed noise intensity causes temporal drift, which is effectively mitigated by an annealing strategy.

Section 07

Limitations and Future Improvement Directions

Limitations

High computational resource requirements; local M4 only supports small batch sizes;
Template design is highly domain-specific, requiring manual design for expansion;
Sparse rewards in some domains affect convergence.

Future Directions

Develop a general method for generating structural reward templates;
Explore adaptive domain weight adjustment strategies;
Expand to larger models (7B, 13B).

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49