# Reasoning Trace Distillation: Can Small Models Learn to Think?

> Exploring how to enable the Qwen3 1.7B small model to acquire complex reasoning abilities by distilling the reasoning traces of DeepSeek-R1. Comparing five training methods to reveal feasible paths for the thinking evolution of small models.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-04T15:09:51.000Z
- 最近活动: 2026-04-04T15:19:55.727Z
- 热度: 157.8
- 关键词: 推理蒸馏, GRPO, 小模型, DeepSeek-R1, Qwen3, 强化学习, LoRA
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-maxiruess-reasoning-distillation-grpo
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-maxiruess-reasoning-distillation-grpo
- Markdown 来源: floors_fallback

---

## [Main Floor/Introduction] Reasoning Trace Distillation: Can Small Models Learn to Think?

This project centers on the core question of 'Can small models learn to think?' and explores the feasibility of transferring complex reasoning abilities to the Qwen3 1.7B small model by distilling the reasoning traces of DeepSeek-R1. By comparing five different training methods (baseline, SFT trace distillation, RL-verified trace re-distillation, pure GRPO reinforcement learning, two-stage hybrid training), it attempts to reveal feasible paths for the thinking evolution of small models.

## Project Background and Core Issues

### Project Background and Core Issues

With the rise of reasoning large models like DeepSeek-R1, the industry is focused on how to transfer their reasoning capabilities to small models. Small models have advantages such as low deployment cost, fast inference speed, and edge-friendliness, but lack complex chain thinking abilities.

Traditional supervised fine-tuning (SFT) allows models to learn answers, but it is difficult to cultivate reasoning thinking; reinforcement learning (e.g., GRPO) can stimulate reasoning potential, but has high training costs and insufficient stability. This project aims to resolve this contradiction and explore the feasibility of distilling reasoning traces from large models to small models.

## Comparison of Five Experimental Conditions

### Comparison of Five Experimental Conditions

The project designs five training strategy combinations:
1. **Baseline condition**: Only traditional supervised fine-tuning with the Orca Math dataset (no reasoning process) as a reference benchmark.
2. **SFT trace distillation**: Supervised fine-tuning using the s1K-1.1 dataset containing DeepSeek-R1's complete reasoning traces to imitate the large model's thinking process.
3. **RL-verified trace re-distillation**: Using the Open-R1 dataset (only containing correct reasoning traces verified by RL) to provide high-quality training signals.
4. **Pure GRPO reinforcement learning**: Directly starting GRPO training from the base model to test the small model's ability to independently learn reasoning strategies.
5. **Two-stage hybrid training**: First SFT trace distillation, then GRPO fine-tuning, combining the advantages of imitation and exploratory learning.

## Key Details of Technical Implementation

### Key Details of Technical Implementation

- **Model configuration**: Using Qwen3 1.7B, LoRA (rank 64) + rsLoRA to avoid gradient collapse, training with bfloat16 precision.
- **Dual mirror strategy**: SFT mirror based on PyTorch2.8+flash-attention, GRPO mirror using trl[vllm] to resolve memory pool and compilation conflicts.
- **Reward function**: Binary reward (answer correctness: 0/1) + format reward (encouraging specific output formats), handling TRL message dictionary format.
- **Tokenizer alignment**: Setting eos_token to "</think>" to resolve the Qwen3 end token misalignment issue.

## Evaluation Methods and Benchmark Tests

### Evaluation Methods and Benchmark Tests

Using GSM8K and MATH mathematical reasoning benchmarks, supporting pass@k metrics:
- **Recoverable evaluation**: spot_check_gsm8k supports the start_from parameter for resuming after interruption.
- **Quick test**: quick_test provides 5-sample rapid verification for easy iteration.
- **Distributed support**: Elastic resource scheduling of L40S (for SFT) and H100 (for GRPO) via Modal.

## Engineering Practice Value

### Engineering Practice Value

- **Configuration-driven architecture**: A single config.yaml manages hyperparameters, with nested configuration inheritance and overwriting to avoid hardcoding and drift.
- **Modular design**: Separating data loading, reward calculation, training, and evaluation modules with single responsibilities for easy reuse.
- **Adaptive attention**: Automatically detecting flash-attention availability, prioritizing its use and falling back to SDPA if not available to ensure hardware compatibility.

## Research Significance and Outlook

### Research Significance and Outlook

This project is a systematic exploration of "Can small models learn to think?" Through comparative experiments, it can quantitatively analyze:
- The effect of simply imitating large models' reasoning traces.
- The improvement of distillation quality by RL verification.
- Whether pure RL can enable models to independently develop reasoning abilities.
- The synergistic effect of two-stage training.

These results will provide important references for small model reasoning optimization, and the rigorous experimental design and open-source spirit are worthy of recognition.

## Conclusion

### Conclusion

In today's era of expanding AI capabilities, enabling small models to obtain reasoning abilities close to those of large models has both academic and practical significance. Through carefully designed comparative experiments, this project contributes empirical data and methodological references, which are worthy of in-depth study and reference by developers concerned with the balance between model efficiency and capability.
