Reading

Qwen Small Model Reasoning Ability Distillation Practice: Exploration of Combining SFT and On-Policy Distillation

Exploring how to transfer the reasoning capabilities of large models to small Qwen models through the combination of Supervised Fine-Tuning (SFT) and on-policy distillation, enabling efficient inference on edge devices.

Qwen模型蒸馏监督微调在线策略蒸馏推理模型边缘计算小模型优化SFTdistillation

Published 2026-06-13 22:36Recent activity 2026-06-13 22:57Estimated read 8 min

Section 01

Qwen Small Model Reasoning Ability Distillation Practice: Exploration of Combining SFT and On-Policy Distillation (Introduction)

This project aims to explore how to transfer the reasoning capabilities of large models to small Qwen models through the combination of Supervised Fine-Tuning (SFT) and on-policy distillation, in order to achieve efficient inference on edge devices. The core innovation lies in adopting an on-policy distillation mode of "learning by doing", allowing the student model to actively generate reasoning processes and optimize based on real-time feedback from the teacher model, breaking through the limitations of traditional methods. (Original author: kakopappa, Source: GitHub, Release date: 2026-06-13)

Section 02

Background: Dilemmas in Transferring Large Model Reasoning Capabilities and Limitations of Existing Methods

With the excellent performance of Large Language Models (LLMs) in complex reasoning tasks, how to transfer their capabilities to resource-constrained small models has become a focus of the industry. Traditional Supervised Fine-Tuning (SFT) can make small models imitate the output of large models, but it is difficult to acquire the internal reasoning chain; static distillation allows the student model to passively learn the "standard answers" of the teacher, which cannot be dynamically adjusted, limiting reasoning flexibility.

Section 03

Project Overview: Innovative Attempt of On-Policy Distillation

This project focuses on cultivating the reasoning capabilities of small models in the Qwen series, innovatively combining SFT and on-policy distillation. Unlike offline methods, on-policy distillation allows the student model to actively generate answers during training, with real-time evaluation and feedback from the teacher model—similar to the human process of "learning by doing", which is more suitable for the multi-path solution characteristics of reasoning tasks.

Section 04

Analysis of Core Technical Mechanisms

Supervised Fine-Tuning (SFT) Phase

First, SFT is performed using high-quality reasoning datasets (including chain-of-thought annotations) to lay the foundation for the model's reasoning ability and help it understand reasonable reasoning steps and logical expressions.

On-Policy Distillation Phase

Sampling and Generation: The student model generates multiple candidate answers for the problem;
Policy Evaluation: The teacher model evaluates the quality of candidate answers and provides reward signals;
Policy Optimization: The student model adjusts parameters based on rewards to optimize towards higher reward directions.

Model Architecture and Training Strategy

The Qwen series is selected (considering Chinese-English balance and open licensing), and a curriculum learning strategy (from simple to complex tasks) is adopted to ensure training stability.

Section 05

Experimental Design and Evaluation Dimensions

The project evaluation covers four major dimensions:

Reasoning Accuracy: Accuracy on mathematical reasoning benchmarks such as GSM8K and MATH;
Generation Quality: Coherence and interpretability of the reasoning process;
Computational Efficiency: Inference speed and memory usage (adapted for edge devices);
Generalization Ability: Performance on reasoning tasks outside the training data to verify general reasoning capabilities.

Section 06

Practical Significance and Application Prospects

This work provides a feasible path for edge-side inference, solving the latency, privacy, and cost issues of cloud deployment. Specific application scenarios include:

Smart assistants on mobile devices (no network required);
Educational tutoring (real-time math problem solving and idea explanation);
Lightweight code assistance in development environments;
Edge real-time image reasoning and defect detection for industrial quality inspection.

Section 07

Technical Limitations and Future Directions

Limitations:

Training Stability: The online loop is sensitive to hyperparameters, easily diverging or converging to suboptimal solutions;
Dependence on Teacher Model: Distillation effect is affected by the quality of the teacher model;
Computational Overhead: More resource-intensive than pure SFT. Future Directions: Introduce multi-teacher integration, efficient sampling strategies, and combine RLHF to optimize model behavior.

Section 08

Conclusion

The Qwen small model reasoning ability distillation experiment represents an important exploration direction for edge-side large model applications. Through the combination of SFT and on-policy distillation, it demonstrates a feasible path to cultivate the reasoning ability of small models under limited resources. As the demand for edge-side AI grows, such model compression and capability transfer research will become more important. We look forward to the continuous iteration of the project to provide practical experience and open-source resources for the community.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23