Reading

Tiny Think: A Study on Reasoning-Prior Post-Training of 140M-Parameter Small Models with Single-Card Training

Tiny Think is a post-training study focused on the reasoning capabilities of ultra-small language models (140M parameters). The project explores the impact of supervised fine-tuning and preference optimization on mathematical and general reasoning abilities using a single consumer-grade GPU, revealing the capability trade-off phenomenon that may arise from post-training.

小语言模型后训练推理能力DPO监督微调数学推理单卡训练开源研究

Published 2026-04-07 04:25Recent activity 2026-04-07 04:51Estimated read 6 min

Tiny Think: A Study on Reasoning-Prior Post-Training of 140M-Parameter Small Models with Single-Card Training

Section 01

Tiny Think Research Guide: Exploration of Reasoning-Prior Post-Training for 140M Small Models with Single-Card Training

Tiny Think is a post-training study on the reasoning capabilities of 140M-parameter ultra-small language models. It explores the impact of Supervised Fine-Tuning (SFT) and preference optimization (DPO/APO) on mathematical and general reasoning abilities using a single consumer-grade GPU, revealing the capability trade-off phenomenon in post-training (i.e., the "capability tax" where improvement in specific tasks is accompanied by degradation in general abilities). The research focuses on the practical value of edge deployment, and the code, models, and paper have been open-sourced.

Section 02

Research Background: Uncharted Territory and Practical Value of Small Model Reasoning

The scale race of large language models continues, but a more practical question is whether ultra-small models can achieve effective reasoning. Tiny Think focuses on 140M-parameter models and explores the effect of reasoning-prior post-training under strict hardware constraints. Reasons for choosing 140M parameters: small enough (runs on a single consumer-grade GPU), large enough (to encode reasoning patterns), and close to the upper limit of mobile/edge deployment—its results have direct practical value.

Section 03

Core Questions and Two-Stage Post-Training Scheme

Core research questions: 1. Can SFT generate mathematical reasoning capabilities at the 140M scale? 2. Can preference optimization improve mathematical accuracy? 3. Does optimization lead to degradation of other abilities? Experimental environment: Single machine with a single RTX5060Ti (16GB), full-parameter fine-tuning, base model fixed as facebook/MobileLLM-R1-140M-base. Two-stage scheme: The first stage (SFT) uses approximately 60 million tokens of mathematical/STEM data (filtered and adapted from allenai/Dolci-Think-SFT-7B); the second stage (preference optimization) uses approximately 10 million tokens of preference pair data, attempting DPO and APO-zero algorithms to calibrate reasoning path selection.

Section 04

Key Findings: The 'Capability Tax' of Mathematical Ability Improvement and General Ability Degradation

The experiment reveals the 'capability tax' phenomenon: post-training improves performance on specific tasks but is accompanied by degradation of general abilities. Specific data: After SFT, GSM8K accuracy is 8.04%, BBH general reasoning is 23.84%, IFEval instruction following is 21.63%; after DPO, GSM8K rises to 9.40%, but BBH drops to 13.18% and IFEval to 16.45%; after APO-zero, GSM8K is 8.26%, BBH is 12.01%, IFEval is 16.08%. Preference optimization improves mathematical ability but impairs general reasoning and instruction following abilities.

Section 05

Evaluation System and Technical Implementation Details

The evaluation system covers multiple dimensions: mathematical benchmarks (GSM8K, MATH500), general reasoning (BBH), instruction following (IFEval), STEM tasks (MMLU-STEM, ARC-Challenge, etc.). Evaluation tools: vLLM inference acceleration + lm-eval framework to ensure efficiency and reproducibility. Technical implementation: Python3.12 + uv package manager, integrated with Liger Kernel optimization based on the trl library; code structure is divided into configuration (YAML), data, training, and evaluation modules; the project is positioned as a controlled research codebase, not a general training framework.

Section 06

Research Significance and Contributions to Open-Source Ecosystem

Theoretical significance: Reveals the capability trade-off pattern of post-training for ultra-small models; Practical implications: Small model deployment needs to balance mathematical, general reasoning, and instruction following abilities, and establish a comprehensive evaluation system; Hardware feasibility: Proves that a single consumer-grade GPU can complete high-quality research. Open-source: Under the Apache-2.0 license, code, models (SFT/DPO/APO checkpoints), and papers are publicly available, with a Hugging Face collection released to facilitate community research.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15