Section 01
Introduction / Main Floor: GRPO Reasoning Fine-tuning: Enhancing Mathematical Reasoning Capabilities of Small Models via Group Relative Policy Optimization
This project uses the GRPO (Group Relative Policy Optimization) method to fine-tune the SmolLM2-135M small model, optimizing both reasoning accuracy and structured output format simultaneously on the GSM8K mathematical dataset through a multi-objective reward system.