Zing Forum

Reading

Open-source Implementation of Training 7B Language Models for Mathematical Reasoning Using GRPO

This project fully reproduces the reasoning training process from the DeepSeek-R1 paper. Through two-stage training (SFT cold start + GRPO reinforcement learning), it enables the Qwen2.5-7B model to learn step-by-step reasoning to solve mathematical problems, achieving verifiable reward signal optimization without manual preference labels.

GRPODeepSeek-R1Qwen2.5数学推理强化学习大语言模型开源复现冷启动奖励饱和
Published 2026-05-13 15:58Recent activity 2026-05-13 16:31Estimated read 8 min
Open-source Implementation of Training 7B Language Models for Mathematical Reasoning Using GRPO
1

Section 01

Introduction / Main Post: Open-source Implementation of Training 7B Language Models for Mathematical Reasoning Using GRPO

This project fully reproduces the reasoning training process from the DeepSeek-R1 paper. Through two-stage training (SFT cold start + GRPO reinforcement learning), it enables the Qwen2.5-7B model to learn step-by-step reasoning to solve mathematical problems, achieving verifiable reward signal optimization without manual preference labels.

2

Section 02

Project Overview

This project is an open-source implementation that fully reproduces the reasoning training process from the DeepSeek-R1 paper. Its goal is to teach 7B-parameter language models to solve mathematical problems through step-by-step reasoning using Group Relative Policy Optimization (GRPO) technology. Unlike PPO, which requires manually labeled preference data, GRPO eliminates the need for an independent judgment network by using intra-group relative rewards, significantly reducing computational overhead.

This implementation is based on the Qwen2.5-7B-Instruct model and completes training on a single NVIDIA H100 NVL (99.9GB VRAM), providing a reproducible reasoning enhancement solution for small and medium-sized teams.


3

Section 03

Phase 1: SFT Cold Start (Supervised Fine-tuning)

Objective: Before starting reinforcement learning, let the model learn the output format for 'thinking'.

Training data includes approximately 27,000 examples:

  • GSM8K training set (7,473 entries): K-12 math word problems
  • NuminaMath-CoT sampling (20,000 entries): Competition-level math problems and their chain-of-thought solutions

Key Training Configuration:

  • Full-parameter fine-tuning (without LoRA) to ensure the model has sufficient capacity to learn new behaviors
  • 2 epochs, effective batch size of 32
  • Learning rate of 2e-5 with cosine decay
  • Key Technique: Loss masking for prompt tokens, so gradients only flow through the reasoning completion part

Training Results: Training loss of 0.3357, token accuracy of 92.5%, taking approximately 2 hours.

4

Section 04

Phase 2: GRPO Reinforcement Learning

Core Innovation: GRPO does not require an independent critic network; instead, it uses intra-group relative rewards as the baseline.

Reward Function Design (verifiable triplet):

Reward Dimension Weight Judgment Logic
Correctness 1.0 The parsed final answer matches the standard answer
Format 0.5 Contains valid <think>...</think> tag structure
Length Penalty -0.1 (soft) Penalty when the response exceeds the 500-800 token range

Key Hyperparameters:

  • Group size G=4: Generate 4 candidate answers per problem
  • KL coefficient of 0.04: Prevent the policy from deviating too far from the SFT reference point
  • 1,000 GRPO steps, learning rate of 5e-7

5

Section 05

Benchmark Results and Analysis

Using lm-evaluation-harness to evaluate three model stages under the same settings:

Benchmark Instruct Baseline SFT Checkpoint GRPO Final
GSM8K 8-shot 82.64% 75.51% 75.66%
MATH 500 4-shot 20.60% 24.20% 24.20%
ARC-Challenge 25-shot 67.06% 62.97% 62.80%
6

Section 06

Key Findings

1. GSM8K Score Drop is an Evaluation Artifact

SFT training changed the model's output format—now the model generates <think> reasoning chains before giving the answer, while the GSM8K parser in lm-evaluation-harness is calibrated for the original Instruct model's direct answer style. This does not mean a regression in reasoning ability.

2. MATH Benchmark +3.6% is a Real Ability Improvement

The model was never trained on MATH problems (training data only includes GSM8K and NuminaMath), but the increase from 20.60% to 24.20% indicates that SFT successfully installed a generalizable reasoning format rather than simple pattern matching.

3. Reason for Limited GRPO Improvement: Reward Saturation

The project authors discovered an important technical phenomenon: since the SFT cold start was very successful (most GSM8K rollouts were correct), the 4 rollouts in a group often received the same reward, leading to an advantage signal close to zero.

Measurement data shows: frac_reward_zero_std averages 0.63, meaning 63% of batches produced near-zero gradient signals. This is the problem that curriculum filtering mentioned in the DeepSeek-R1 paper aims to solve—we should select medium-difficulty problems where only 1-2 rollouts are correct for the model, rather than simple problems where 80% are correct.


7

Section 07

Why Use Full-Parameter Fine-tuning Instead of LoRA for SFT?

LoRA only updates a small number of parameters in low-rank adapters, which is suitable for incremental learning. However, the goal of cold start is to install a brand-new behavioral prior (structured CoT format), and full-parameter fine-tuning gives the model greater capacity for distribution shift. The 99GB VRAM of H100 is sufficient to accommodate full-parameter training of the 7B model.

8

Section 08

Why Only Use GSM8K for GRPO?

GRPO requires verifiable reward signals—answers must be programmatically checkable. GSM8K's answers are clean numerical values, while NuminaMath competition problems have more complex answer formats, which would increase the error rate of the reward function.