Zing Forum

Reading

Rethinking Policy Distillation for Large Language Models: Phenomena, Mechanisms, and Practical Guide

This paper systematically studies the dynamics and mechanisms of Policy Distillation (OPD), identifies two key conditions that determine the success or failure of OPD, reveals that a successful OPD is characterized by 97%-99% of probability mass concentrated on a small shared token set, and proposes two practical strategies: offline cold start and teacher-aligned prompt selection.

策略蒸馏知识蒸馏大语言模型后训练token对齐教师选择模型优化
Published 2026-04-15 01:54Recent activity 2026-04-15 10:57Estimated read 7 min
Rethinking Policy Distillation for Large Language Models: Phenomena, Mechanisms, and Practical Guide
1

Section 01

【Introduction】Rethinking Policy Distillation for Large Language Models: Core Findings and Practical Guide

This paper systematically studies the dynamics and mechanisms of Policy Distillation (OPD), identifies two key conditions that determine the success or failure of OPD—thinking mode compatibility and the teacher providing new capabilities; reveals that a successful OPD is characterized by 97%-99% of probability mass concentrated on a small shared token set; proposes two practical strategies: offline cold start and teacher-aligned prompt selection; and also discusses the hidden costs of OPD and future research directions such as long-range distillation.

2

Section 02

1. Policy Distillation: Core Post-Training Technology and Research Background

Policy Distillation (OPD) is a core post-training technology for large language models. Unlike traditional Supervised Fine-Tuning (SFT), it uses the output generated by the student model itself as training signals, guided by the evaluation of the teacher model, and has significant advantages in complex tasks such as mathematical reasoning and code generation. However, there is currently a lack of systematic understanding of its training dynamics and internal mechanisms; questions such as the reasons for OPD's success or failure, characteristics of success, and methods to fix failures need urgent answers.

3

Section 03

2. Two Key Conditions Determining OPD's Success or Failure

The study identifies two key conditions for OPD's success:

  1. Thinking Mode Compatibility: The student and teacher must share similar reasoning paths and strategies (e.g., if the teacher uses algebraic methods while the student uses enumeration, it will be difficult to work);
  2. Teacher Provides New Capabilities: The teacher must demonstrate problem-solving skills or reasoning patterns that the student has not yet mastered. If only repeating the patterns already known by the student, OPD can hardly bring substantial improvement.
4

Section 04

3. Weak-to-Strong Reverse Distillation Experiment: Verifying Key Conditions

To verify the conditions, the team designed a weak-to-strong reverse distillation experiment: using a weak model with 1.5B parameters as the teacher and a strong model with 7B parameters as the student. The results show that the 1.5B and 7B teachers from the same family are distributionally indistinguishable to the student—even if the 7B model is more capable, if it cannot provide new capabilities that the student does not have, distillation is ineffective, which verifies the importance of the second condition.

5

Section 05

4. Token-Level Micro Features of Successful OPD

The micro-mechanism of successful OPD is manifested as:

  1. Progressive Alignment of High-Probability Tokens: The student gradually selects tokens consistent with the teacher's high-probability tokens at key positions;
  2. Small Shared Token Set Phenomenon: 97%-99% of the probability mass is concentrated on a small shared token set, reducing the learning search space and focusing on key decision points.
6

Section 06

5. Two Practical Strategies to Fix Failed OPD

Based on the understanding of the mechanism, two repair strategies are proposed:

  1. Offline Cold Start: First use SFT data to enable the student to reach basic capabilities before starting OPD, solving the problem of poor initial strategy quality;
  2. Teacher-Aligned Prompt Selection: Screen prompts for which the teacher can generate high-quality responses to ensure effective training signals.
7

Section 07

6. Hidden Costs of OPD and Practical Implications

The dense token rewards of OPD have costs: credit assignment issues, short-sighted optimization risks, and difficulties with long-range dependencies. Practical implications include:

  • Teacher selection needs to consider thinking compatibility and provision of new capabilities;
  • Failure diagnosis can check output distribution overlap, token set probability concentration, etc.;
  • Improvement strategies can use cold start, prompt selection, and monitoring token alignment.
8

Section 08

7. Research Limitations and Future Directions

Research Limitations: The task scope is limited to verifiable tasks such as mathematical reasoning and code generation; experiments are conducted on small and medium-sized models (1.5B-7B); the effectiveness in long-range tasks is not verified; theoretical depth requires more mathematical analysis. Future Directions: Explore the application of OPD in long-range tasks, expand to open-ended generation tasks and large-scale models, and deepen theoretical understanding.