# SimpleRL-Zoo: Using Minimalist Reinforcement Learning Recipes to Enhance Mathematical Reasoning Capabilities of Foundation Models

> The SimpleRL-Zoo project, open-sourced by the NLP Lab at Hong Kong University of Science and Technology, demonstrates a surprisingly efficient training method: using only 8K mathematical data samples and a rule-based reward function, it can achieve an absolute accuracy improvement of 10 to 20 percentage points in mathematical reasoning tasks for 10 different open-source foundation models.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-16T13:43:01.000Z
- 最近活动: 2026-04-16T13:58:54.112Z
- 热度: 154.7
- 关键词: 强化学习, 数学推理, GRPO, 开源模型, Qwen, Llama, Mistral, DeepSeek, Verl, vLLM
- 页面链接: https://www.zingnex.cn/en/forum/thread/simplerl-zoo-recipe
- Canonical: https://www.zingnex.cn/forum/thread/simplerl-zoo-recipe
- Markdown 来源: floors_fallback

---

## SimpleRL-Zoo Project Overview: Minimalist RL Method Significantly Enhances Mathematical Reasoning of Foundation Models

The SimpleRL-Zoo project, open-sourced by the NLP Lab at Hong Kong University of Science and Technology, demonstrates an efficient training method: using only 8K mathematical data samples and a rule-based reward function, it can achieve an absolute accuracy improvement of 10 to 20 percentage points in mathematical reasoning tasks for 10 different open-source foundation models (covering 0.5B to 32B parameters, including Llama3, Mistral, DeepSeekMath, Qwen2.5 series, etc.).

## Project Background and Key Findings

The SimpleRL-Zoo project brings a breakthrough in reinforcement learning for reasoning training of large language models. The research team trained 10 foundation models of different architectures (parameter range: 0.5B-32B), including Llama3 8B, Mistral7B/24B, DeepSeekMath7B, Qwen2.5 series (0.5B, 1.5B, 7B, 14B, 32B), and Qwen2.5-Math-7B. These models achieved an accuracy improvement of 10 to over 20 percentage points on standard mathematical reasoning benchmarks such as GSM8K, MATH500, Minerva Math, Olympiad Bench, AIME24, and AMC23.

## Detailed Technical Approach

### Training Data Design
Uses a hierarchical difficulty progression strategy: simple level (GSM8K, MATH Level 1), medium level (MATH Levels 1-4), and hard level (MATH Levels 3-5), simulating human learning paths.

### Reinforcement Learning Algorithm
Implements the GRPO (Group Relative Policy Optimization) algorithm based on the Verl framework, which does not require value function estimation. It optimizes the policy by comparing multiple outputs for the same problem, reducing computational overhead. Combined with the Ray distributed framework and vLLM inference acceleration engine, it achieves efficient parallel training.

### Reward Function Design
Uses a purely rule-based reward mechanism, with advantages including strong interpretability, high stability, and low cost (no need for additional reward model training).

## Key Experimental Results and Analysis

#### Model Performance Improvement Comparison
Average accuracy of selected models before and after training:
| Model | Before Training | After Training | Improvement |
|------|--------|--------|----------|
| Qwen-2.5-Math-7B | 37.2% | 59.5% | +22.3% |
| Qwen-2.5-32B | 45.9% | 61.9% | +16.0% |
| Mistral-Small-24B | 27.6% | 49.6% | +22.0% |
| DeepSeek-Math-7B |11.3% |29.2% |+17.9% |
| Llama-3.1-8B |10.6% |22.0% |+11.4% |

Qwen-2.5-Math-7B improved from 13.3% to 40.0% in AIME24 (Pass@1).

#### Reasoning Behavior Analysis
RL training increased the model's response length, indicating more detailed step-by-step reasoning. However, the increase in response length is not necessarily related to cognitive behaviors such as self-verification, and different models have different reasoning patterns.

## Hardware Requirements and Training Efficiency

- **Minimum Configuration**: A single H100/A100-80G GPU can train the Qwen-2.5-0.5B model
- **7B/14B Models**: 2x8 H100-80G GPUs, taking about 15 hours to complete 100 training steps
- **32B Model**: 8x8 H100-80G GPUs, taking about 1.5 days to complete training

The relatively modest hardware requirements facilitate reproduction and expansion.

## Open-Source Contributions and Community Value

SimpleRL-Zoo is fully open-sourced, including:
- Complete training code and configuration files
- 10 RL-trained model weights (released on Hugging Face)
- Intermediate training checkpoints
- Gradio visualization tool (for analyzing reasoning processes)
- Evaluation scripts and analysis tools

Licensed under Apache 2.0, the code depends on the Verl framework and vLLM acceleration, and references Qwen2.5-Math evaluation code.

## Practical Significance and Future Outlook

The project demonstrates the potential of RL training to enhance model reasoning capabilities under limited resources, which is of great value to resource-constrained institutions, specific fields (mathematical education, scientific computing), and model optimization. It validates the RL concepts of works like DeepSeek-R1 and provides a reproducible path.

Summary: The project stimulates the deep reasoning capabilities of models through a small amount of high-quality data and a simple reward mechanism, embodying the "less is more" philosophy. Its open-source nature and rich documentation lay the foundation for subsequent research.