Zing Forum

Reading

OPD: Re-examining On-Policy Distillation for Large Language Models—Phenomena, Mechanisms, and Practical Guide

The systematic study on On-Policy Distillation (OPD) proposed by Tsinghua University's NLP Lab reveals the limitations of traditional knowledge distillation and presents a complete practical methodology for OPD.

知识蒸馏大语言模型模型压缩On-Policy清华大学NLP机器学习
Published 2026-04-30 00:35Recent activity 2026-04-30 00:51Estimated read 7 min
OPD: Re-examining On-Policy Distillation for Large Language Models—Phenomena, Mechanisms, and Practical Guide
1

Section 01

OPD: Re-examining On-Policy Distillation for LLMs - Phenomena, Mechanisms & Practice Guide

This post summarizes the systematic study on On-Policy Distillation (OPD) by Tsinghua University's NLP Lab. The research reveals limitations of traditional off-policy knowledge distillation and provides a complete OPD practice methodology, covering core phenomena, underlying mechanisms, experimental validation, and actionable guidelines for model compression and deployment.

2

Section 02

Research Background: Challenges in LLM Distillation

With exponential growth in LLM parameter scales, deploying large models in resource-constrained environments becomes critical. Traditional knowledge distillation (KD) uses off-policy strategies—static datasets with teacher outputs as targets—but suffers from distribution differences between teacher and student models, limiting effective knowledge transfer. On-Policy Distillation (OPD) addresses this by having students generate responses actively, then receiving teacher feedback for better alignment.

3

Section 03

Core Phenomena of OPD

OPD experiments reveal key differences from off-policy distillation:

  1. Distribution Alignment: OPD enables better behavior alignment by letting students explore response spaces and correct via teacher feedback, unlike passive learning in off-policy.
  2. Exploration-Exploitation Tradeoff: Students need balance between exploring new responses and using known good ones to avoid stagnation or instability.
  3. Differential Ability Transfer: OPD efficiently transfers some abilities (e.g., format following) but requires complex strategies for others (e.g., deep reasoning).
4

Section 04

Why OPD Works: Underlying Mechanisms

OPD's effectiveness stems from three key mechanisms:

  1. Countering Distribution Shift: OPD addresses exposure bias by letting students face their own errors and learn corrections via teacher feedback, unlike off-policy where students never see their mistakes.
  2. Dynamic Curriculum Learning: OPD adapts to student progress—starting with simple samples and moving to complex ones, with feedback intensity adjusted dynamically.
  3. Implicit Reward Modeling: Teacher evaluations of student responses form an automated, low-cost implicit reward model, similar to RLHF but without human input.
5

Section 05

OPD Practice Guide: Actionable Strategies

The OPD project provides a practical recipe:

  • Data Strategy: Use dynamic data flow (student generation → teacher evaluation → high-quality sample selection → iterative training), prioritizing quality over quantity.
  • Training Stability: Mix off-policy and OPD losses, use temperature annealing, response truncation, and ensure teacher consistency.
  • Efficiency Optimization: Cache common teacher responses, parallelize student generation and teacher evaluation, use small-batch updates.
  • Evaluation: Use dynamic metrics to assess generation quality evolution, not just static test set performance.
6

Section 06

Experimental Results: OPD's Performance Advantages

OPD outperforms off-policy distillation in multiple benchmarks:

  • Instruction Following: On AlpacaEval/MT-Bench, OPD students exceed same-scale off-policy models and approach teacher levels.
  • Knowledge-Intensive Tasks: Better knowledge retention on TriviaQA/Natural Questions, showing effective fact transfer.
  • Reasoning: Stronger performance on GSM8K math reasoning, indicating improved learning of teacher's reasoning chains.
7

Section 07

Limitations, Open Questions & Industry Impact

Limitations: Higher computational cost (teacher online participation), sensitive hyperparameters, challenges with long sequences and multi-round dialogues. Industry Implications:

  • For model vendors: OPD offers a path to reduce deployment costs while preserving ability, potentially enabling teacher APIs and distillation services.
  • For enterprises: Customize small models with private data via OPD, balancing privacy and performance.
  • For researchers: Crosses KD and RL, opening new research directions.
8

Section 08

Conclusion: OPD's Role in LLM Deployment

OPD represents a shift from empirical to scientific KD methods, answering 'if OPD is better' and 'how to do it well'. As LLM deployment costs gain attention, OPD will play a key role in model compression and edge deployment. Understanding OPD's principles and practices is valuable for engineers and researchers in this field.