Zing Forum

Reading

Math Reasoning Arena: End-to-End Training Practice for Lightweight Math Reasoning Models

A complete two-stage alignment project that transforms a 0.5B-parameter base model into a professional math reasoning assistant using SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) techniques, supporting CPU training and featuring an interactive web interface.

LLM数学推理DPOSFT模型微调Qwen轻量级模型CPU训练
Published 2026-06-08 00:15Recent activity 2026-06-08 00:19Estimated read 6 min
Math Reasoning Arena: End-to-End Training Practice for Lightweight Math Reasoning Models
1

Section 01

Introduction to Math Reasoning Arena: End-to-End Training Project for Lightweight Math Reasoning Models

Core Points: Math Reasoning Arena is a complete two-stage alignment project that transforms a 0.5B-parameter base model into a professional math reasoning assistant using SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) techniques, supporting CPU training and featuring an interactive web interface.

Project Basic Information:

This project aims to lower the barrier to training math reasoning models, enabling individual developers and small teams to participate.

2

Section 02

Project Background and Motivation

Math reasoning is a weak point of large language models; even large-parameter models often make logical errors. Traditional training to improve math ability requires significant computing resources, which deters individual developers.

This project proves that through a well-designed training process, lightweight models (0.5B parameters) can also achieve satisfactory math reasoning capabilities, and the entire process is compatible with CPU operation, greatly lowering the participation threshold.

3

Section 03

Two-Stage Training Process and Model Selection

Two-Stage Alignment Training Process

  1. Supervised Fine-Tuning (SFT): Using the MetaMathQA dataset (2000+ math problems with chain-of-thought), teach the model to understand problem structures and generate standardized solutions.
  2. Direct Preference Optimization (DPO): No reward model needed; use positive/negative sample pairs (correct reasoning vs. incorrect reasoning) to let the model learn preferences and internalize correct reasoning patterns.

Model Selection

Trained based on Qwen2.5-0.5B for the following reasons:

  • High parameter efficiency, trainable on consumer-grade hardware
  • Strong base capability, excellent performance in benchmark tests
  • Open-source friendly with lenient license agreement

An adapted GPT-2 version is also provided for comparison.

4

Section 04

Dataset Construction and Interactive Web Interface

Dataset Construction

  • SFT Dataset: From MetaMathQA, 2000+ instruction-response pairs with detailed chain-of-thought, covering various problem types.
  • DPO Dataset: Construct positive/negative sample pairs, where positive examples are correct solutions and negative examples are common error patterns.

Interactive Web Interface

  • Flask API Backend: RESTful design, supports service deployment with scalable architecture.
  • Streamlit Frontend: Intuitive interaction, real-time display of reasoning processes, supports parameter adjustment and result comparison.
5

Section 05

Training Results and Evaluation

The project provides detailed evaluation results comparing the performance of the base model, SFT model, and DPO model:

  • Base Model: Basic language understanding but limited math reasoning ability
  • SFT Model: Learns answer formats and generates structured responses
  • DPO Model: Improves answer accuracy and reduces reasoning errors

A quick start script (run_app.bat) is provided for new users to quickly experience the trained models.

6

Section 06

Project Significance and Insights

Practical Significance

  1. Lower Threshold: CPU-compatible training process allows more developers to participate in fine-tuning
  2. Methodology Demonstration: The two-stage alignment process (SFT+DPO) can be replicated in other fields
  3. Data Importance: High-quality structured data is more effective than increasing parameter count
  4. Open-Source Ecosystem: Based on Qwen and public datasets, fully reproducible

Summary

Math Reasoning Arena is an excellent case of end-to-end training, providing a complete solution from data preparation to deployment, and is an ideal starting point for getting into large model fine-tuning.