Reading

Cultivating Reasoning Capabilities in Small Models: A Methodology for Training Arithmetic Reasoning from Scratch with Transformers

A systematic empirical study revealing that curriculum learning design is more important than early RL application; targeted curriculum SFT + KL-regularized RL can improve the arithmetic reasoning accuracy of small models from 80.7% to 90.7%

Transformer课程学习监督微调强化学习算术推理KL正则化Pass@k小模型SFTRL

Published 2026-05-18 23:42Recent activity 2026-05-19 00:23Estimated read 7 min

Cultivating Reasoning Capabilities in Small Models: A Methodology for Training Arithmetic Reasoning from Scratch with Transformers

Section 01

Cultivating Reasoning Capabilities in Small Models: Core Findings and Methodology Overview

This article introduces the research findings of the open-source project small-LM-reasoning-posttraining: a small Transformer built from scratch can acquire arithmetic reasoning capabilities through carefully designed curriculum learning and post-training strategies. The core finding is that curriculum design is far more important than early RL application—it is necessary to first establish basic capabilities via targeted curriculum SFT, then refine them with KL-regularized RL. The final strategy improves the arithmetic reasoning accuracy of small models from 80.7% to 90.7%, while providing a reproducible research framework that also has reference value for large model training.

Section 02

Research Background: Exploration of Reasoning Capabilities in Small Models

Large language models (such as GPT-4, Claude) exhibit strong reasoning capabilities, but can small models acquire such capabilities without massive parameters/data? Inspired by Stanford CS336, the small-LM-reasoning-posttraining project fully implements causal Transformer, byte-level tokenizer, synthetic reasoning data generation, SFT, sampling evaluation, reward modeling, and KL-regularized RL. Core question: When does reasoning-oriented post-training truly improve small model capabilities, and when does it only teach answer formats or template matching?

Section 03

Core Methods: Curriculum Design and Training Strategies

Curriculum Design: Progressive Learning Path

Design arithmetic courses from simple to complex: single-digit addition → double-digit addition without carry → double-digit addition with carry → mixed-digit addition → general addition. To address the hidden weakness in mixed-digit scenarios (models perform worse on mixed tasks than pure tasks), explicit mixed-digit training buckets are added as a solution.

Pass@k Evaluation

The Pass@k metric is used to measure the model's sampling capability (at least one correct result in k samples), which determines the feasibility of RL training: the targeted SFT model achieves 99% Pass@8, providing sufficient signals for RL.

KL-Regularized RL

A strategy combining answer validator rewards + KL divergence penalties is used to constrain the policy near the SFT checkpoint and prevent deviation. Beta parameter scanning shows stability: the general accuracy fluctuates between 91.4% and 91.6% under different values, and Pass@8 remains at 98%-100%.

Section 04

Experimental Evidence: Key Data and Results

Targeted curriculum SFT improves mixed-digit problems: the low-sum accuracy increases from 64.8% (control group) to 85.4%; The targeted SFT model achieves 99% Pass@8, while the old curriculum control group only reaches 81%; The final strategy (targeted curriculum SFT + KL-regularized RL) improves general accuracy from 80.7% to 90.7%, maintaining a 100% answer parsing rate and high Pass@8 performance; KL beta parameter (0.02/0.05/0.10) tests show stability with small result fluctuations.

Section 05

Failure Mode Analysis: Model Limitations

Qualitative analysis of failure cases reveals: targeted SFT fixes the format corruption issue of the old curriculum, but there are still errors in difficult mixed-digit prompts—for example, when handling '12+3', there are number replacement (18) or operand duplication (123) errors. These systematic weaknesses indicate the need for more targeted training data or architectural adjustments.

Section 06

Methodological Contributions and Insights

Methodological Contributions

Provides a complete and reproducible research framework for small model reasoning: compact causal Transformer implementation, byte-level tokenizer, synthetic data generation pipeline, multi-seed control experiments, hyperparameter scanning, and qualitative failure analysis tools.

Insights for Large Model Training

SFT quality determines the upper limit of RL: if SFT does not include correct answers in the sampling distribution, RL reward signals are ineffective;
Progressive curriculum design may be superior to SFT with a single large-scale instruction dataset.

Section 07

Conclusion: The Value of Small Model Research

The small-LM-reasoning-posttraining project provides empirical guidance for cultivating reasoning capabilities in small models through rigorous experimental design and in-depth analysis. The core conclusion (curriculum design is superior to blind RL) challenges existing training practices. In AI research dominated by large models, small model research has controllable costs and short cycles, and can reveal essential laws hidden by the complexity of large models—thus having important value.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15