Zing Forum

Reading

SimpleRL-Zoo: Using Minimalist Reinforcement Learning Recipes to Enhance Mathematical Reasoning Capabilities of Foundation Models

The SimpleRL-Zoo project, open-sourced by the NLP Lab at Hong Kong University of Science and Technology, demonstrates a surprisingly efficient training method: using only 8K mathematical data samples and a rule-based reward function, it can achieve an absolute accuracy improvement of 10 to 20 percentage points in mathematical reasoning tasks for 10 different open-source foundation models.

强化学习数学推理GRPO开源模型QwenLlamaMistralDeepSeekVerlvLLM
Published 2026-04-16 21:43Recent activity 2026-04-16 21:58Estimated read 7 min
SimpleRL-Zoo: Using Minimalist Reinforcement Learning Recipes to Enhance Mathematical Reasoning Capabilities of Foundation Models
1

Section 01

SimpleRL-Zoo Project Overview: Minimalist RL Method Significantly Enhances Mathematical Reasoning of Foundation Models

The SimpleRL-Zoo project, open-sourced by the NLP Lab at Hong Kong University of Science and Technology, demonstrates an efficient training method: using only 8K mathematical data samples and a rule-based reward function, it can achieve an absolute accuracy improvement of 10 to 20 percentage points in mathematical reasoning tasks for 10 different open-source foundation models (covering 0.5B to 32B parameters, including Llama3, Mistral, DeepSeekMath, Qwen2.5 series, etc.).

2

Section 02

Project Background and Key Findings

The SimpleRL-Zoo project brings a breakthrough in reinforcement learning for reasoning training of large language models. The research team trained 10 foundation models of different architectures (parameter range: 0.5B-32B), including Llama3 8B, Mistral7B/24B, DeepSeekMath7B, Qwen2.5 series (0.5B, 1.5B, 7B, 14B, 32B), and Qwen2.5-Math-7B. These models achieved an accuracy improvement of 10 to over 20 percentage points on standard mathematical reasoning benchmarks such as GSM8K, MATH500, Minerva Math, Olympiad Bench, AIME24, and AMC23.

3

Section 03

Detailed Technical Approach

Training Data Design

Uses a hierarchical difficulty progression strategy: simple level (GSM8K, MATH Level 1), medium level (MATH Levels 1-4), and hard level (MATH Levels 3-5), simulating human learning paths.

Reinforcement Learning Algorithm

Implements the GRPO (Group Relative Policy Optimization) algorithm based on the Verl framework, which does not require value function estimation. It optimizes the policy by comparing multiple outputs for the same problem, reducing computational overhead. Combined with the Ray distributed framework and vLLM inference acceleration engine, it achieves efficient parallel training.

Reward Function Design

Uses a purely rule-based reward mechanism, with advantages including strong interpretability, high stability, and low cost (no need for additional reward model training).

4

Section 04

Key Experimental Results and Analysis

Model Performance Improvement Comparison

Average accuracy of selected models before and after training:

Model Before Training After Training Improvement
Qwen-2.5-Math-7B 37.2% 59.5% +22.3%
Qwen-2.5-32B 45.9% 61.9% +16.0%
Mistral-Small-24B 27.6% 49.6% +22.0%
DeepSeek-Math-7B 11.3% 29.2% +17.9%
Llama-3.1-8B 10.6% 22.0% +11.4%

Qwen-2.5-Math-7B improved from 13.3% to 40.0% in AIME24 (Pass@1).

Reasoning Behavior Analysis

RL training increased the model's response length, indicating more detailed step-by-step reasoning. However, the increase in response length is not necessarily related to cognitive behaviors such as self-verification, and different models have different reasoning patterns.

5

Section 05

Hardware Requirements and Training Efficiency

  • Minimum Configuration: A single H100/A100-80G GPU can train the Qwen-2.5-0.5B model
  • 7B/14B Models: 2x8 H100-80G GPUs, taking about 15 hours to complete 100 training steps
  • 32B Model: 8x8 H100-80G GPUs, taking about 1.5 days to complete training

The relatively modest hardware requirements facilitate reproduction and expansion.

6

Section 06

Open-Source Contributions and Community Value

SimpleRL-Zoo is fully open-sourced, including:

  • Complete training code and configuration files
  • 10 RL-trained model weights (released on Hugging Face)
  • Intermediate training checkpoints
  • Gradio visualization tool (for analyzing reasoning processes)
  • Evaluation scripts and analysis tools

Licensed under Apache 2.0, the code depends on the Verl framework and vLLM acceleration, and references Qwen2.5-Math evaluation code.

7

Section 07

Practical Significance and Future Outlook

The project demonstrates the potential of RL training to enhance model reasoning capabilities under limited resources, which is of great value to resource-constrained institutions, specific fields (mathematical education, scientific computing), and model optimization. It validates the RL concepts of works like DeepSeek-R1 and provides a reproducible path.

Summary: The project stimulates the deep reasoning capabilities of models through a small amount of high-quality data and a simple reward mechanism, embodying the "less is more" philosophy. Its open-source nature and rich documentation lay the foundation for subsequent research.