Zing Forum

Reading

Reasoning Trace Distillation: Can Small Models Learn to Think?

Exploring how to enable the Qwen3 1.7B small model to acquire complex reasoning abilities by distilling the reasoning traces of DeepSeek-R1. Comparing five training methods to reveal feasible paths for the thinking evolution of small models.

推理蒸馏GRPO小模型DeepSeek-R1Qwen3强化学习LoRA
Published 2026-04-04 23:09Recent activity 2026-04-04 23:19Estimated read 8 min
Reasoning Trace Distillation: Can Small Models Learn to Think?
1

Section 01

[Main Floor/Introduction] Reasoning Trace Distillation: Can Small Models Learn to Think?

This project centers on the core question of 'Can small models learn to think?' and explores the feasibility of transferring complex reasoning abilities to the Qwen3 1.7B small model by distilling the reasoning traces of DeepSeek-R1. By comparing five different training methods (baseline, SFT trace distillation, RL-verified trace re-distillation, pure GRPO reinforcement learning, two-stage hybrid training), it attempts to reveal feasible paths for the thinking evolution of small models.

2

Section 02

Project Background and Core Issues

Project Background and Core Issues

With the rise of reasoning large models like DeepSeek-R1, the industry is focused on how to transfer their reasoning capabilities to small models. Small models have advantages such as low deployment cost, fast inference speed, and edge-friendliness, but lack complex chain thinking abilities.

Traditional supervised fine-tuning (SFT) allows models to learn answers, but it is difficult to cultivate reasoning thinking; reinforcement learning (e.g., GRPO) can stimulate reasoning potential, but has high training costs and insufficient stability. This project aims to resolve this contradiction and explore the feasibility of distilling reasoning traces from large models to small models.

3

Section 03

Comparison of Five Experimental Conditions

Comparison of Five Experimental Conditions

The project designs five training strategy combinations:

  1. Baseline condition: Only traditional supervised fine-tuning with the Orca Math dataset (no reasoning process) as a reference benchmark.
  2. SFT trace distillation: Supervised fine-tuning using the s1K-1.1 dataset containing DeepSeek-R1's complete reasoning traces to imitate the large model's thinking process.
  3. RL-verified trace re-distillation: Using the Open-R1 dataset (only containing correct reasoning traces verified by RL) to provide high-quality training signals.
  4. Pure GRPO reinforcement learning: Directly starting GRPO training from the base model to test the small model's ability to independently learn reasoning strategies.
  5. Two-stage hybrid training: First SFT trace distillation, then GRPO fine-tuning, combining the advantages of imitation and exploratory learning.
4

Section 04

Key Details of Technical Implementation

Key Details of Technical Implementation

  • Model configuration: Using Qwen3 1.7B, LoRA (rank 64) + rsLoRA to avoid gradient collapse, training with bfloat16 precision.
  • Dual mirror strategy: SFT mirror based on PyTorch2.8+flash-attention, GRPO mirror using trl[vllm] to resolve memory pool and compilation conflicts.
  • Reward function: Binary reward (answer correctness: 0/1) + format reward (encouraging specific output formats), handling TRL message dictionary format.
  • Tokenizer alignment: Setting eos_token to "" to resolve the Qwen3 end token misalignment issue.
5

Section 05

Evaluation Methods and Benchmark Tests

Evaluation Methods and Benchmark Tests

Using GSM8K and MATH mathematical reasoning benchmarks, supporting pass@k metrics:

  • Recoverable evaluation: spot_check_gsm8k supports the start_from parameter for resuming after interruption.
  • Quick test: quick_test provides 5-sample rapid verification for easy iteration.
  • Distributed support: Elastic resource scheduling of L40S (for SFT) and H100 (for GRPO) via Modal.
6

Section 06

Engineering Practice Value

Engineering Practice Value

  • Configuration-driven architecture: A single config.yaml manages hyperparameters, with nested configuration inheritance and overwriting to avoid hardcoding and drift.
  • Modular design: Separating data loading, reward calculation, training, and evaluation modules with single responsibilities for easy reuse.
  • Adaptive attention: Automatically detecting flash-attention availability, prioritizing its use and falling back to SDPA if not available to ensure hardware compatibility.
7

Section 07

Research Significance and Outlook

Research Significance and Outlook

This project is a systematic exploration of "Can small models learn to think?" Through comparative experiments, it can quantitatively analyze:

  • The effect of simply imitating large models' reasoning traces.
  • The improvement of distillation quality by RL verification.
  • Whether pure RL can enable models to independently develop reasoning abilities.
  • The synergistic effect of two-stage training.

These results will provide important references for small model reasoning optimization, and the rigorous experimental design and open-source spirit are worthy of recognition.

8

Section 08

Conclusion

Conclusion

In today's era of expanding AI capabilities, enabling small models to obtain reasoning abilities close to those of large models has both academic and practical significance. Through carefully designed comparative experiments, this project contributes empirical data and methodological references, which are worthy of in-depth study and reference by developers concerned with the balance between model efficiency and capability.