# Cultivating Reasoning Capabilities in Small Models: A Methodology for Training Arithmetic Reasoning from Scratch with Transformers

> A systematic empirical study revealing that curriculum learning design is more important than early RL application; targeted curriculum SFT + KL-regularized RL can improve the arithmetic reasoning accuracy of small models from 80.7% to 90.7%

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-18T15:42:45.000Z
- 最近活动: 2026-05-18T16:23:29.535Z
- 热度: 154.3
- 关键词: Transformer, 课程学习, 监督微调, 强化学习, 算术推理, KL正则化, Pass@k, 小模型, SFT, RL
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-liuhprogramming-small-lm-reasoning-posttraining
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-liuhprogramming-small-lm-reasoning-posttraining
- Markdown 来源: floors_fallback

---

## Cultivating Reasoning Capabilities in Small Models: Core Findings and Methodology Overview

This article introduces the research findings of the open-source project small-LM-reasoning-posttraining: a small Transformer built from scratch can acquire arithmetic reasoning capabilities through carefully designed curriculum learning and post-training strategies. The core finding is that **curriculum design is far more important than early RL application**—it is necessary to first establish basic capabilities via targeted curriculum SFT, then refine them with KL-regularized RL. The final strategy improves the arithmetic reasoning accuracy of small models from 80.7% to 90.7%, while providing a reproducible research framework that also has reference value for large model training.

## Research Background: Exploration of Reasoning Capabilities in Small Models

Large language models (such as GPT-4, Claude) exhibit strong reasoning capabilities, but can small models acquire such capabilities without massive parameters/data? Inspired by Stanford CS336, the small-LM-reasoning-posttraining project fully implements causal Transformer, byte-level tokenizer, synthetic reasoning data generation, SFT, sampling evaluation, reward modeling, and KL-regularized RL. Core question: When does reasoning-oriented post-training truly improve small model capabilities, and when does it only teach answer formats or template matching?

## Core Methods: Curriculum Design and Training Strategies

### Curriculum Design: Progressive Learning Path
Design arithmetic courses from simple to complex: single-digit addition → double-digit addition without carry → double-digit addition with carry → mixed-digit addition → general addition. To address the hidden weakness in mixed-digit scenarios (models perform worse on mixed tasks than pure tasks), explicit mixed-digit training buckets are added as a solution.

### Pass@k Evaluation
The Pass@k metric is used to measure the model's sampling capability (at least one correct result in k samples), which determines the feasibility of RL training: the targeted SFT model achieves 99% Pass@8, providing sufficient signals for RL.

### KL-Regularized RL
A strategy combining answer validator rewards + KL divergence penalties is used to constrain the policy near the SFT checkpoint and prevent deviation. Beta parameter scanning shows stability: the general accuracy fluctuates between 91.4% and 91.6% under different values, and Pass@8 remains at 98%-100%.

## Experimental Evidence: Key Data and Results

Targeted curriculum SFT improves mixed-digit problems: the low-sum accuracy increases from 64.8% (control group) to 85.4%;
The targeted SFT model achieves 99% Pass@8, while the old curriculum control group only reaches 81%;
The final strategy (targeted curriculum SFT + KL-regularized RL) improves general accuracy from 80.7% to 90.7%, maintaining a 100% answer parsing rate and high Pass@8 performance;
KL beta parameter (0.02/0.05/0.10) tests show stability with small result fluctuations.

## Failure Mode Analysis: Model Limitations

Qualitative analysis of failure cases reveals: targeted SFT fixes the format corruption issue of the old curriculum, but there are still errors in difficult mixed-digit prompts—for example, when handling '12+3', there are number replacement (18) or operand duplication (123) errors. These systematic weaknesses indicate the need for more targeted training data or architectural adjustments.

## Methodological Contributions and Insights

### Methodological Contributions
Provides a complete and reproducible research framework for small model reasoning: compact causal Transformer implementation, byte-level tokenizer, synthetic data generation pipeline, multi-seed control experiments, hyperparameter scanning, and qualitative failure analysis tools.

### Insights for Large Model Training
1. SFT quality determines the upper limit of RL: if SFT does not include correct answers in the sampling distribution, RL reward signals are ineffective;
2. Progressive curriculum design may be superior to SFT with a single large-scale instruction dataset.

## Conclusion: The Value of Small Model Research

The small-LM-reasoning-posttraining project provides empirical guidance for cultivating reasoning capabilities in small models through rigorous experimental design and in-depth analysis. The core conclusion (curriculum design is superior to blind RL) challenges existing training practices. In AI research dominated by large models, small model research has controllable costs and short cycles, and can reveal essential laws hidden by the complexity of large models—thus having important value.