# Math-SLM: Efficient Training of a Small Math Reasoning Model in 3.5 Hours

> The math-slm project demonstrates how to fine-tune the math reasoning ability of DeepSeek-R1-Distill-Qwen-7B in just 3.5 hours using 8 H100 GPUs, adopting a combined strategy of SFT+DPO+LoRA, providing an efficient solution for model training in resource-constrained scenarios.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-07T09:07:04.000Z
- 最近活动: 2026-05-24T09:19:24.448Z
- 热度: 77.0
- 关键词: 数学推理, 模型微调, LoRA, DPO, SFT, DeepSeek, 高效训练
- 页面链接: https://www.zingnex.cn/en/forum/thread/math-slm-3-5
- Canonical: https://www.zingnex.cn/forum/thread/math-slm-3-5
- Markdown 来源: floors_fallback

---

## [Introduction] Math-SLM: Efficient Training of a Small Math Reasoning Model in 3.5 Hours

This project was published by debtirthasaha on GitHub (link: https://github.com/debtirthasaha/math-slm), demonstrating how to fine-tune the math reasoning ability of DeepSeek-R1-Distill-Qwen-7B in just 3.5 hours using 8 H100 GPUs. The core strategy is a combination of SFT (Supervised Fine-Tuning) + DPO (Direct Preference Optimization) + LoRA (Low-Rank Adaptation), providing an efficient solution for model training in resource-constrained scenarios.

## Project Background and Motivation

Math reasoning is one of the core capability challenges for large language models. Closed-source models like GPT-4 and Claude perform well, but the open-source community needs to explore efficient paths for small models under limited resources. Although DeepSeek-R1-Distill-Qwen-7B has strong reasoning capabilities, traditional full-parameter fine-tuning is costly and has strict hardware requirements. This project aims to significantly reduce training costs while maintaining performance through a combined strategy.

## Analysis of Core Technical Solutions

1. **LoRA**: Applied to the projection matrix of the attention layer, keeping the original model weights unchanged while only training a small number of low-rank matrices, significantly reducing the number of parameters and computational overhead; 2. **SFT**: Using high-quality math instruction datasets to enable the model to learn standard problem-solving steps and logical chains; 3. **DPO**: Learning directly from preference data, simplifying the process and improving output accuracy and readability.

## Training Efficiency Optimization Strategies

1. **Distributed Parallelism**: Combining data parallelism (each GPU processes different batches) and model parallelism (addressing single-card memory limitations) to fully utilize 8 H100 GPUs; 2. **Mixed Precision Training**: Using FP16/BF16 to reduce memory usage and computation time, with gradient accumulation to balance batch size; 3. **Efficient Data Processing**: Optimizing tokenization and batch processing to minimize I/O waiting and ensure GPU is fully utilized for computation.

## Model Performance and Evaluation

The trained model has been released on Hugging Face (MR0b0t/math-slm-sft-dpo-v5). Although there are no detailed benchmark scores, the expected performance includes: basic arithmetic/algebra can accurately perform multi-step calculations; geometry/probability can convert natural language into expressions and apply theorems; complex reasoning can generate step-by-step processes with interpretability.

## Practical Value and Application Scenarios

- Researchers/Developers: A reproducible and efficient training template that can be extended to larger models or other reasoning domains; - Resource-Constrained Teams: Proves that competitive specialized models can be trained with low resources; - EdTech: Suitable for intelligent tutoring, automatic grading, and personalized recommendations, with better accuracy and consistency.

## Limitations and Future Directions

**Limitations**: Domain specificity (only math reasoning), scale limitations (7B models struggle with advanced math), data dependency (data sources and filtering not disclosed); **Future**: Expand to larger models (14B/32B), cover more math domains (advanced/competition math), explore more efficient algorithms (QLoRA/DoRA, etc.).
