Section 01
【Introduction】Rethinking Policy Distillation for Large Language Models: Core Findings and Practical Guide
This paper systematically studies the dynamics and mechanisms of Policy Distillation (OPD), identifies two key conditions that determine the success or failure of OPD—thinking mode compatibility and the teacher providing new capabilities; reveals that a successful OPD is characterized by 97%-99% of probability mass concentrated on a small shared token set; proposes two practical strategies: offline cold start and teacher-aligned prompt selection; and also discusses the hidden costs of OPD and future research directions such as long-range distillation.