Reading

Fine-tuning Qwen2.5-3B with GRPO Reinforcement Learning: Enabling Small Language Models to Master Mathematical Reasoning

This article introduces an open-source project that uses the GRPO (Group Relative Policy Optimization) algorithm to fine-tune the Qwen2.5-3B-Instruct model via reinforcement learning. The project focuses on training the model to solve structured mathematical puzzles, and through specific reasoning format constraints, enables small models to exhibit strong mathematical reasoning and symbolic computation capabilities.

GRPO强化学习Qwen2.5数学推理小型语言模型PPO模型微调结构化输出

Published 2026-05-05 15:45Recent activity 2026-05-05 15:52Estimated read 7 min

Fine-tuning Qwen2.5-3B with GRPO Reinforcement Learning: Enabling Small Language Models to Master Mathematical Reasoning

Section 01

【Introduction】Enabling Qwen2.5-3B to Master Mathematical Reasoning with GRPO Reinforcement Learning

This article introduces an open-source project: fine-tuning the Qwen2.5-3B-Instruct model using the GRPO (Group Relative Policy Optimization) reinforcement learning algorithm, enabling it to exhibit strong reasoning capabilities in structured mathematical puzzle tasks. The project breaks through the constraints of scaling laws, allowing small models to rival large models in specific tasks; the core is to improve performance through structured output constraints and efficient training methods.

Section 02

Project Background and Core Tasks

Objective Task

Focus on digital combination mathematical puzzles: Given numbers and operation symbols, construct an expression that uses each number exactly once and results in the target value. Challenges include combinatorial explosion, precise calculation, symbolic reasoning, and constraint satisfaction.

Output Specifications

Require the model to generate structured responses: display the thinking process within the <reasoning> tag, and provide a verifiable expression within the <answer> tag, which facilitates verification and reward function design.

Section 03

GRPO Algorithm and Base Model Selection

GRPO Algorithm Improvements

Evolved from PPO, the core is to estimate the advantage function through relative performance within a group: sample multiple candidate answers for the same input, calculate relative advantage based on the average reward within the group, eliminating the overhead of the value network and improving efficiency and stability.

Base Model Selection

Choose Qwen2.5-3B-Instruct: moderate scale (3 billion parameters), strong instruction-following ability, multilingual support, open-source friendly. The base model already has basic mathematical capabilities but needs reinforcement learning optimization.

Section 04

Training Process and Technical Details

Reward Function Design

Multi-dimensional evaluation: correctness reward (core), format reward (tag compliance), process reward (reasoning logic), efficiency reward (concise expression).

Training Data Construction

Procedurally generated: randomly generate numbers and target values, verify solvability with a solver, filter extremely difficult problems, cover different levels.

Hyperparameter Tuning

Key parameters: group size, learning rate, KL divergence constraint, reward scaling. Determine configurations suitable for the 3B model through experiments.

Section 05

Experimental Results and Capability Demonstration

Quantitative Evaluation

Accuracy: increased from baseline 40% to over 80%
Format compliance rate: over 95%
Generalization ability: maintains good performance on unseen number combinations

Qualitative Analysis

The model demonstrates strategy differentiation (e.g., multiply first then adjust), self-correction, step decomposition, etc., with coherent reasoning processes.

Section 06

Technical Significance and Application Prospects

Implications for Small Model Research

Prove that efficient training can make small models excel in specific domains, suitable for edge computing, cost-sensitive scenarios, and rapid iteration.

Educational Applications

Intelligent tutoring (step analysis), adaptive practice (personalized puzzles), process evaluation (thinking process).

Methodology Transfer

Can be applied to code generation, logical puzzles, symbolic computation, constraint satisfaction problems, etc.

Section 07

Challenges and Limitations

Task scope: Only focuses on digital combination problems, not generalized to complex mathematical tasks like geometric proofs
Computational resources: GRPO group sampling requires multiple forward passes, training time still needs optimization
Reward hacking: Need to be alert to models exploiting reward loopholes; mitigate through multi-dimensional design and manual spot checks.

Section 08

Conclusion: Rebalancing Efficiency and Capability

The project demonstrates an AI research trend: shifting from scale pursuit to training method optimization. Small models can shine in specific domains through algorithmic innovation and process design, promoting AI democratization. It provides developers with a reproducible solution, conveying the concept of balancing efficiency and capability under resource constraints.