Zing Forum

Reading

Tiny Think: A Study on Reasoning-Prior Post-Training of 140M-Parameter Small Models with Single-Card Training

Tiny Think is a post-training study focused on the reasoning capabilities of ultra-small language models (140M parameters). The project explores the impact of supervised fine-tuning and preference optimization on mathematical and general reasoning abilities using a single consumer-grade GPU, revealing the capability trade-off phenomenon that may arise from post-training.

小语言模型后训练推理能力DPO监督微调数学推理单卡训练开源研究
Published 2026-04-07 04:25Recent activity 2026-04-07 04:51Estimated read 6 min
Tiny Think: A Study on Reasoning-Prior Post-Training of 140M-Parameter Small Models with Single-Card Training
1

Section 01

Tiny Think Research Guide: Exploration of Reasoning-Prior Post-Training for 140M Small Models with Single-Card Training

Tiny Think is a post-training study on the reasoning capabilities of 140M-parameter ultra-small language models. It explores the impact of Supervised Fine-Tuning (SFT) and preference optimization (DPO/APO) on mathematical and general reasoning abilities using a single consumer-grade GPU, revealing the capability trade-off phenomenon in post-training (i.e., the "capability tax" where improvement in specific tasks is accompanied by degradation in general abilities). The research focuses on the practical value of edge deployment, and the code, models, and paper have been open-sourced.

2

Section 02

Research Background: Uncharted Territory and Practical Value of Small Model Reasoning

The scale race of large language models continues, but a more practical question is whether ultra-small models can achieve effective reasoning. Tiny Think focuses on 140M-parameter models and explores the effect of reasoning-prior post-training under strict hardware constraints. Reasons for choosing 140M parameters: small enough (runs on a single consumer-grade GPU), large enough (to encode reasoning patterns), and close to the upper limit of mobile/edge deployment—its results have direct practical value.

3

Section 03

Core Questions and Two-Stage Post-Training Scheme

Core research questions: 1. Can SFT generate mathematical reasoning capabilities at the 140M scale? 2. Can preference optimization improve mathematical accuracy? 3. Does optimization lead to degradation of other abilities? Experimental environment: Single machine with a single RTX5060Ti (16GB), full-parameter fine-tuning, base model fixed as facebook/MobileLLM-R1-140M-base. Two-stage scheme: The first stage (SFT) uses approximately 60 million tokens of mathematical/STEM data (filtered and adapted from allenai/Dolci-Think-SFT-7B); the second stage (preference optimization) uses approximately 10 million tokens of preference pair data, attempting DPO and APO-zero algorithms to calibrate reasoning path selection.

4

Section 04

Key Findings: The 'Capability Tax' of Mathematical Ability Improvement and General Ability Degradation

The experiment reveals the 'capability tax' phenomenon: post-training improves performance on specific tasks but is accompanied by degradation of general abilities. Specific data: After SFT, GSM8K accuracy is 8.04%, BBH general reasoning is 23.84%, IFEval instruction following is 21.63%; after DPO, GSM8K rises to 9.40%, but BBH drops to 13.18% and IFEval to 16.45%; after APO-zero, GSM8K is 8.26%, BBH is 12.01%, IFEval is 16.08%. Preference optimization improves mathematical ability but impairs general reasoning and instruction following abilities.

5

Section 05

Evaluation System and Technical Implementation Details

The evaluation system covers multiple dimensions: mathematical benchmarks (GSM8K, MATH500), general reasoning (BBH), instruction following (IFEval), STEM tasks (MMLU-STEM, ARC-Challenge, etc.). Evaluation tools: vLLM inference acceleration + lm-eval framework to ensure efficiency and reproducibility. Technical implementation: Python3.12 + uv package manager, integrated with Liger Kernel optimization based on the trl library; code structure is divided into configuration (YAML), data, training, and evaluation modules; the project is positioned as a controlled research codebase, not a general training framework.

6

Section 06

Research Significance and Contributions to Open-Source Ecosystem

Theoretical significance: Reveals the capability trade-off pattern of post-training for ultra-small models; Practical implications: Small model deployment needs to balance mathematical, general reasoning, and instruction following abilities, and establish a comprehensive evaluation system; Hardware feasibility: Proves that a single consumer-grade GPU can complete high-quality research. Open-source: Under the Apache-2.0 license, code, models (SFT/DPO/APO checkpoints), and papers are publicly available, with a Hugging Face collection released to facilitate community research.