# Open-source Implementation of Training 7B Language Models for Mathematical Reasoning Using GRPO

> This project fully reproduces the reasoning training process from the DeepSeek-R1 paper. Through two-stage training (SFT cold start + GRPO reinforcement learning), it enables the Qwen2.5-7B model to learn step-by-step reasoning to solve mathematical problems, achieving verifiable reward signal optimization without manual preference labels.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-13T07:58:04.000Z
- 最近活动: 2026-05-13T08:31:25.721Z
- 热度: 161.4
- 关键词: GRPO, DeepSeek-R1, Qwen2.5, 数学推理, 强化学习, 大语言模型, 开源复现, 冷启动, 奖励饱和
- 页面链接: https://www.zingnex.cn/en/forum/thread/grpo7b
- Canonical: https://www.zingnex.cn/forum/thread/grpo7b
- Markdown 来源: floors_fallback

---

## Introduction / Main Post: Open-source Implementation of Training 7B Language Models for Mathematical Reasoning Using GRPO

This project fully reproduces the reasoning training process from the DeepSeek-R1 paper. Through two-stage training (SFT cold start + GRPO reinforcement learning), it enables the Qwen2.5-7B model to learn step-by-step reasoning to solve mathematical problems, achieving verifiable reward signal optimization without manual preference labels.

## Project Overview

This project is an open-source implementation that fully reproduces the reasoning training process from the DeepSeek-R1 paper. Its goal is to teach 7B-parameter language models to solve mathematical problems through step-by-step reasoning using **Group Relative Policy Optimization (GRPO)** technology. Unlike PPO, which requires manually labeled preference data, GRPO eliminates the need for an independent judgment network by using intra-group relative rewards, significantly reducing computational overhead.

This implementation is based on the Qwen2.5-7B-Instruct model and completes training on a single NVIDIA H100 NVL (99.9GB VRAM), providing a reproducible reasoning enhancement solution for small and medium-sized teams.

---

## Phase 1: SFT Cold Start (Supervised Fine-tuning)

**Objective**: Before starting reinforcement learning, let the model learn the output format for 'thinking'.

Training data includes approximately 27,000 examples:
- GSM8K training set (7,473 entries): K-12 math word problems
- NuminaMath-CoT sampling (20,000 entries): Competition-level math problems and their chain-of-thought solutions

**Key Training Configuration**:
- Full-parameter fine-tuning (without LoRA) to ensure the model has sufficient capacity to learn new behaviors
- 2 epochs, effective batch size of 32
- Learning rate of 2e-5 with cosine decay
- **Key Technique**: Loss masking for prompt tokens, so gradients only flow through the reasoning completion part

**Training Results**: Training loss of 0.3357, token accuracy of 92.5%, taking approximately 2 hours.

## Phase 2: GRPO Reinforcement Learning

**Core Innovation**: GRPO does not require an independent critic network; instead, it uses **intra-group relative rewards** as the baseline.

**Reward Function Design** (verifiable triplet):

| Reward Dimension | Weight | Judgment Logic |
|---------|------|---------|
| Correctness | 1.0 | The parsed final answer matches the standard answer |
| Format | 0.5 | Contains valid `<think>...</think>` tag structure |
| Length Penalty | -0.1 (soft) | Penalty when the response exceeds the 500-800 token range |

**Key Hyperparameters**:
- Group size G=4: Generate 4 candidate answers per problem
- KL coefficient of 0.04: Prevent the policy from deviating too far from the SFT reference point
- 1,000 GRPO steps, learning rate of 5e-7

---

## Benchmark Results and Analysis

Using lm-evaluation-harness to evaluate three model stages under the same settings:

| Benchmark | Instruct Baseline | SFT Checkpoint | GRPO Final |
|---------|-------------|----------|---------|
| GSM8K 8-shot | 82.64% | 75.51% | 75.66% |
| MATH 500 4-shot | 20.60% | 24.20% | 24.20% |
| ARC-Challenge 25-shot | 67.06% | 62.97% | 62.80% |

## Key Findings

**1. GSM8K Score Drop is an Evaluation Artifact**

SFT training changed the model's output format—now the model generates `<think>` reasoning chains before giving the answer, while the GSM8K parser in lm-evaluation-harness is calibrated for the original Instruct model's direct answer style. This does not mean a regression in reasoning ability.

**2. MATH Benchmark +3.6% is a Real Ability Improvement**

The model was never trained on MATH problems (training data only includes GSM8K and NuminaMath), but the increase from 20.60% to 24.20% indicates that SFT successfully installed a **generalizable reasoning format** rather than simple pattern matching.

**3. Reason for Limited GRPO Improvement: Reward Saturation**

The project authors discovered an important technical phenomenon: since the SFT cold start was very successful (most GSM8K rollouts were correct), the 4 rollouts in a group often received the same reward, leading to an advantage signal close to zero.

Measurement data shows: `frac_reward_zero_std` averages 0.63, meaning 63% of batches produced near-zero gradient signals. This is the problem that **curriculum filtering** mentioned in the DeepSeek-R1 paper aims to solve—we should select medium-difficulty problems where only 1-2 rollouts are correct for the model, rather than simple problems where 80% are correct.

---

## Why Use Full-Parameter Fine-tuning Instead of LoRA for SFT?

LoRA only updates a small number of parameters in low-rank adapters, which is suitable for incremental learning. However, the goal of cold start is to **install a brand-new behavioral prior** (structured CoT format), and full-parameter fine-tuning gives the model greater capacity for distribution shift. The 99GB VRAM of H100 is sufficient to accommodate full-parameter training of the 7B model.

## Why Only Use GSM8K for GRPO?

GRPO requires **verifiable reward signals**—answers must be programmatically checkable. GSM8K's answers are clean numerical values, while NuminaMath competition problems have more complex answer formats, which would increase the error rate of the reward function.