Reading

Open-source Implementation of Training 7B Language Models for Mathematical Reasoning Using GRPO

This project fully reproduces the reasoning training process from the DeepSeek-R1 paper. Through two-stage training (SFT cold start + GRPO reinforcement learning), it enables the Qwen2.5-7B model to learn step-by-step reasoning to solve mathematical problems, achieving verifiable reward signal optimization without manual preference labels.

GRPODeepSeek-R1Qwen2.5数学推理强化学习大语言模型开源复现冷启动奖励饱和

Published 2026-05-13 15:58Recent activity 2026-05-13 16:31Estimated read 8 min

Section 01

Introduction / Main Post: Open-source Implementation of Training 7B Language Models for Mathematical Reasoning Using GRPO

Section 02

Project Overview

This project is an open-source implementation that fully reproduces the reasoning training process from the DeepSeek-R1 paper. Its goal is to teach 7B-parameter language models to solve mathematical problems through step-by-step reasoning using Group Relative Policy Optimization (GRPO) technology. Unlike PPO, which requires manually labeled preference data, GRPO eliminates the need for an independent judgment network by using intra-group relative rewards, significantly reducing computational overhead.

This implementation is based on the Qwen2.5-7B-Instruct model and completes training on a single NVIDIA H100 NVL (99.9GB VRAM), providing a reproducible reasoning enhancement solution for small and medium-sized teams.

Section 03

Phase 1: SFT Cold Start (Supervised Fine-tuning)

Objective: Before starting reinforcement learning, let the model learn the output format for 'thinking'.

Training data includes approximately 27,000 examples:

GSM8K training set (7,473 entries): K-12 math word problems
NuminaMath-CoT sampling (20,000 entries): Competition-level math problems and their chain-of-thought solutions

Key Training Configuration:

Full-parameter fine-tuning (without LoRA) to ensure the model has sufficient capacity to learn new behaviors
2 epochs, effective batch size of 32
Learning rate of 2e-5 with cosine decay
Key Technique: Loss masking for prompt tokens, so gradients only flow through the reasoning completion part

Training Results: Training loss of 0.3357, token accuracy of 92.5%, taking approximately 2 hours.

Section 04

Phase 2: GRPO Reinforcement Learning

Core Innovation: GRPO does not require an independent critic network; instead, it uses intra-group relative rewards as the baseline.

Reward Function Design (verifiable triplet):

Reward Dimension	Weight	Judgment Logic
Correctness	1.0	The parsed final answer matches the standard answer
Format	0.5	Contains valid `<think>...</think>` tag structure
Length Penalty	-0.1 (soft)	Penalty when the response exceeds the 500-800 token range

Key Hyperparameters:

Group size G=4: Generate 4 candidate answers per problem
KL coefficient of 0.04: Prevent the policy from deviating too far from the SFT reference point
1,000 GRPO steps, learning rate of 5e-7

Section 05

Benchmark Results and Analysis

Using lm-evaluation-harness to evaluate three model stages under the same settings:

Benchmark	Instruct Baseline	SFT Checkpoint	GRPO Final
GSM8K 8-shot	82.64%	75.51%	75.66%
MATH 500 4-shot	20.60%	24.20%	24.20%
ARC-Challenge 25-shot	67.06%	62.97%	62.80%

Section 06

Key Findings

1. GSM8K Score Drop is an Evaluation Artifact

SFT training changed the model's output format—now the model generates <think> reasoning chains before giving the answer, while the GSM8K parser in lm-evaluation-harness is calibrated for the original Instruct model's direct answer style. This does not mean a regression in reasoning ability.

2. MATH Benchmark +3.6% is a Real Ability Improvement

The model was never trained on MATH problems (training data only includes GSM8K and NuminaMath), but the increase from 20.60% to 24.20% indicates that SFT successfully installed a generalizable reasoning format rather than simple pattern matching.

3. Reason for Limited GRPO Improvement: Reward Saturation

The project authors discovered an important technical phenomenon: since the SFT cold start was very successful (most GSM8K rollouts were correct), the 4 rollouts in a group often received the same reward, leading to an advantage signal close to zero.

Measurement data shows: frac_reward_zero_std averages 0.63, meaning 63% of batches produced near-zero gradient signals. This is the problem that curriculum filtering mentioned in the DeepSeek-R1 paper aims to solve—we should select medium-difficulty problems where only 1-2 rollouts are correct for the model, rather than simple problems where 80% are correct.

Section 07

Why Use Full-Parameter Fine-tuning Instead of LoRA for SFT?

LoRA only updates a small number of parameters in low-rank adapters, which is suitable for incremental learning. However, the goal of cold start is to install a brand-new behavioral prior (structured CoT format), and full-parameter fine-tuning gives the model greater capacity for distribution shift. The 99GB VRAM of H100 is sufficient to accommodate full-parameter training of the 7B model.

Section 08

Why Only Use GSM8K for GRPO?

GRPO requires verifiable reward signals—answers must be programmatically checkable. GSM8K's answers are clean numerical values, while NuminaMath competition problems have more complex answer formats, which would increase the error rate of the reward function.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15