# Fine-tuning Qwen2.5-3B with GRPO Reinforcement Learning: Enabling Small Language Models to Master Mathematical Reasoning

> This article introduces an open-source project that uses the GRPO (Group Relative Policy Optimization) algorithm to fine-tune the Qwen2.5-3B-Instruct model via reinforcement learning. The project focuses on training the model to solve structured mathematical puzzles, and through specific reasoning format constraints, enables small models to exhibit strong mathematical reasoning and symbolic computation capabilities.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-05T07:45:42.000Z
- 最近活动: 2026-05-05T07:52:19.970Z
- 热度: 159.9
- 关键词: GRPO, 强化学习, Qwen2.5, 数学推理, 小型语言模型, PPO, 模型微调, 结构化输出
- 页面链接: https://www.zingnex.cn/en/forum/thread/grpoqwen2-5-3b
- Canonical: https://www.zingnex.cn/forum/thread/grpoqwen2-5-3b
- Markdown 来源: floors_fallback

---

## 【Introduction】Enabling Qwen2.5-3B to Master Mathematical Reasoning with GRPO Reinforcement Learning

This article introduces an open-source project: fine-tuning the Qwen2.5-3B-Instruct model using the GRPO (Group Relative Policy Optimization) reinforcement learning algorithm, enabling it to exhibit strong reasoning capabilities in structured mathematical puzzle tasks. The project breaks through the constraints of scaling laws, allowing small models to rival large models in specific tasks; the core is to improve performance through structured output constraints and efficient training methods.

## Project Background and Core Tasks

### Objective Task
Focus on digital combination mathematical puzzles: Given numbers and operation symbols, construct an expression that uses each number exactly once and results in the target value. Challenges include combinatorial explosion, precise calculation, symbolic reasoning, and constraint satisfaction.

### Output Specifications
Require the model to generate structured responses: display the thinking process within the `<reasoning>` tag, and provide a verifiable expression within the `<answer>` tag, which facilitates verification and reward function design.

## GRPO Algorithm and Base Model Selection

### GRPO Algorithm Improvements
Evolved from PPO, the core is to estimate the advantage function through relative performance within a group: sample multiple candidate answers for the same input, calculate relative advantage based on the average reward within the group, eliminating the overhead of the value network and improving efficiency and stability.

### Base Model Selection
Choose Qwen2.5-3B-Instruct: moderate scale (3 billion parameters), strong instruction-following ability, multilingual support, open-source friendly. The base model already has basic mathematical capabilities but needs reinforcement learning optimization.

## Training Process and Technical Details

### Reward Function Design
Multi-dimensional evaluation: correctness reward (core), format reward (tag compliance), process reward (reasoning logic), efficiency reward (concise expression).

### Training Data Construction
Procedurally generated: randomly generate numbers and target values, verify solvability with a solver, filter extremely difficult problems, cover different levels.

### Hyperparameter Tuning
Key parameters: group size, learning rate, KL divergence constraint, reward scaling. Determine configurations suitable for the 3B model through experiments.

## Experimental Results and Capability Demonstration

### Quantitative Evaluation
- Accuracy: increased from baseline 40% to over 80%
- Format compliance rate: over 95%
- Generalization ability: maintains good performance on unseen number combinations

### Qualitative Analysis
The model demonstrates strategy differentiation (e.g., multiply first then adjust), self-correction, step decomposition, etc., with coherent reasoning processes.

## Technical Significance and Application Prospects

### Implications for Small Model Research
Prove that efficient training can make small models excel in specific domains, suitable for edge computing, cost-sensitive scenarios, and rapid iteration.

### Educational Applications
Intelligent tutoring (step analysis), adaptive practice (personalized puzzles), process evaluation (thinking process).

### Methodology Transfer
Can be applied to code generation, logical puzzles, symbolic computation, constraint satisfaction problems, etc.

## Challenges and Limitations

- Task scope: Only focuses on digital combination problems, not generalized to complex mathematical tasks like geometric proofs
- Computational resources: GRPO group sampling requires multiple forward passes, training time still needs optimization
- Reward hacking: Need to be alert to models exploiting reward loopholes; mitigate through multi-dimensional design and manual spot checks.

## Conclusion: Rebalancing Efficiency and Capability

The project demonstrates an AI research trend: shifting from scale pursuit to training method optimization. Small models can shine in specific domains through algorithmic innovation and process design, promoting AI democratization. It provides developers with a reproducible solution, conveying the concept of balancing efficiency and capability under resource constraints.
