Zing Forum

Reading

Math-SLM: Efficient Training of a Small Math Reasoning Model in 3.5 Hours

The math-slm project demonstrates how to fine-tune the math reasoning ability of DeepSeek-R1-Distill-Qwen-7B in just 3.5 hours using 8 H100 GPUs, adopting a combined strategy of SFT+DPO+LoRA, providing an efficient solution for model training in resource-constrained scenarios.

数学推理模型微调LoRADPOSFTDeepSeek高效训练
Published 2026-05-07 17:07Recent activity 2026-05-24 17:19Estimated read 5 min
Math-SLM: Efficient Training of a Small Math Reasoning Model in 3.5 Hours
1

Section 01

[Introduction] Math-SLM: Efficient Training of a Small Math Reasoning Model in 3.5 Hours

This project was published by debtirthasaha on GitHub (link: https://github.com/debtirthasaha/math-slm), demonstrating how to fine-tune the math reasoning ability of DeepSeek-R1-Distill-Qwen-7B in just 3.5 hours using 8 H100 GPUs. The core strategy is a combination of SFT (Supervised Fine-Tuning) + DPO (Direct Preference Optimization) + LoRA (Low-Rank Adaptation), providing an efficient solution for model training in resource-constrained scenarios.

2

Section 02

Project Background and Motivation

Math reasoning is one of the core capability challenges for large language models. Closed-source models like GPT-4 and Claude perform well, but the open-source community needs to explore efficient paths for small models under limited resources. Although DeepSeek-R1-Distill-Qwen-7B has strong reasoning capabilities, traditional full-parameter fine-tuning is costly and has strict hardware requirements. This project aims to significantly reduce training costs while maintaining performance through a combined strategy.

3

Section 03

Analysis of Core Technical Solutions

  1. LoRA: Applied to the projection matrix of the attention layer, keeping the original model weights unchanged while only training a small number of low-rank matrices, significantly reducing the number of parameters and computational overhead; 2. SFT: Using high-quality math instruction datasets to enable the model to learn standard problem-solving steps and logical chains; 3. DPO: Learning directly from preference data, simplifying the process and improving output accuracy and readability.
4

Section 04

Training Efficiency Optimization Strategies

  1. Distributed Parallelism: Combining data parallelism (each GPU processes different batches) and model parallelism (addressing single-card memory limitations) to fully utilize 8 H100 GPUs; 2. Mixed Precision Training: Using FP16/BF16 to reduce memory usage and computation time, with gradient accumulation to balance batch size; 3. Efficient Data Processing: Optimizing tokenization and batch processing to minimize I/O waiting and ensure GPU is fully utilized for computation.
5

Section 05

Model Performance and Evaluation

The trained model has been released on Hugging Face (MR0b0t/math-slm-sft-dpo-v5). Although there are no detailed benchmark scores, the expected performance includes: basic arithmetic/algebra can accurately perform multi-step calculations; geometry/probability can convert natural language into expressions and apply theorems; complex reasoning can generate step-by-step processes with interpretability.

6

Section 06

Practical Value and Application Scenarios

  • Researchers/Developers: A reproducible and efficient training template that can be extended to larger models or other reasoning domains; - Resource-Constrained Teams: Proves that competitive specialized models can be trained with low resources; - EdTech: Suitable for intelligent tutoring, automatic grading, and personalized recommendations, with better accuracy and consistency.
7

Section 07

Limitations and Future Directions

Limitations: Domain specificity (only math reasoning), scale limitations (7B models struggle with advanced math), data dependency (data sources and filtering not disclosed); Future: Expand to larger models (14B/32B), cover more math domains (advanced/competition math), explore more efficient algorithms (QLoRA/DoRA, etc.).