Section 01
【Main Floor】In-depth Analysis of Policy Distillation for Large Language Models: Introduction to Core Mechanisms and Practical Guide
This article focuses on the Off-Policy Distillation (OPD) technique in the post-training of large language models. The Tsinghua University research team systematically reveals two key conditions for its success—reasoning mode compatibility and the teacher model providing new capabilities—and proposes practical improvement methods such as off-policy cold start and teacher-aligned prompt selection. It also discusses the hidden costs of OPD and future research directions.