Reading

OPD: Re-examining On-Policy Distillation for Large Language Models—Phenomena, Mechanisms, and Practical Guide

The systematic study on On-Policy Distillation (OPD) proposed by Tsinghua University's NLP Lab reveals the limitations of traditional knowledge distillation and presents a complete practical methodology for OPD.

知识蒸馏大语言模型模型压缩On-Policy清华大学NLP机器学习

Published 2026-04-30 00:35Recent activity 2026-04-30 00:51Estimated read 7 min

OPD: Re-examining On-Policy Distillation for Large Language Models—Phenomena, Mechanisms, and Practical Guide

Section 01

OPD: Re-examining On-Policy Distillation for LLMs - Phenomena, Mechanisms & Practice Guide

This post summarizes the systematic study on On-Policy Distillation (OPD) by Tsinghua University's NLP Lab. The research reveals limitations of traditional off-policy knowledge distillation and provides a complete OPD practice methodology, covering core phenomena, underlying mechanisms, experimental validation, and actionable guidelines for model compression and deployment.

Section 02

Research Background: Challenges in LLM Distillation

With exponential growth in LLM parameter scales, deploying large models in resource-constrained environments becomes critical. Traditional knowledge distillation (KD) uses off-policy strategies—static datasets with teacher outputs as targets—but suffers from distribution differences between teacher and student models, limiting effective knowledge transfer. On-Policy Distillation (OPD) addresses this by having students generate responses actively, then receiving teacher feedback for better alignment.

Section 03

Core Phenomena of OPD

OPD experiments reveal key differences from off-policy distillation:

Distribution Alignment: OPD enables better behavior alignment by letting students explore response spaces and correct via teacher feedback, unlike passive learning in off-policy.
Exploration-Exploitation Tradeoff: Students need balance between exploring new responses and using known good ones to avoid stagnation or instability.
Differential Ability Transfer: OPD efficiently transfers some abilities (e.g., format following) but requires complex strategies for others (e.g., deep reasoning).

Section 04

Why OPD Works: Underlying Mechanisms

OPD's effectiveness stems from three key mechanisms:

Countering Distribution Shift: OPD addresses exposure bias by letting students face their own errors and learn corrections via teacher feedback, unlike off-policy where students never see their mistakes.
Dynamic Curriculum Learning: OPD adapts to student progress—starting with simple samples and moving to complex ones, with feedback intensity adjusted dynamically.
Implicit Reward Modeling: Teacher evaluations of student responses form an automated, low-cost implicit reward model, similar to RLHF but without human input.

Section 05

OPD Practice Guide: Actionable Strategies

The OPD project provides a practical recipe:

Data Strategy: Use dynamic data flow (student generation → teacher evaluation → high-quality sample selection → iterative training), prioritizing quality over quantity.
Training Stability: Mix off-policy and OPD losses, use temperature annealing, response truncation, and ensure teacher consistency.
Efficiency Optimization: Cache common teacher responses, parallelize student generation and teacher evaluation, use small-batch updates.
Evaluation: Use dynamic metrics to assess generation quality evolution, not just static test set performance.

Section 06

Experimental Results: OPD's Performance Advantages

OPD outperforms off-policy distillation in multiple benchmarks:

Instruction Following: On AlpacaEval/MT-Bench, OPD students exceed same-scale off-policy models and approach teacher levels.
Knowledge-Intensive Tasks: Better knowledge retention on TriviaQA/Natural Questions, showing effective fact transfer.
Reasoning: Stronger performance on GSM8K math reasoning, indicating improved learning of teacher's reasoning chains.

Section 07

Limitations, Open Questions & Industry Impact

Limitations: Higher computational cost (teacher online participation), sensitive hyperparameters, challenges with long sequences and multi-round dialogues. Industry Implications:

For model vendors: OPD offers a path to reduce deployment costs while preserving ability, potentially enabling teacher APIs and distillation services.
For enterprises: Customize small models with private data via OPD, balancing privacy and performance.
For researchers: Crosses KD and RL, opening new research directions.

Section 08

Conclusion: OPD's Role in LLM Deployment

OPD represents a shift from empirical to scientific KD methods, answering 'if OPD is better' and 'how to do it well'. As LLM deployment costs gain attention, OPD will play a key role in model compression and edge deployment. Understanding OPD's principles and practices is valuable for engineers and researchers in this field.

OPD: Re-examining On-Policy Distillation for Large Language Models—Phenomena, Mechanisms, and Practical Guide

OPD: Re-examining On-Policy Distillation for LLMs - Phenomena, Mechanisms & Practice Guide

Research Background: Challenges in LLM Distillation

Core Phenomena of OPD

Why OPD Works: Underlying Mechanisms

OPD Practice Guide: Actionable Strategies

Experimental Results: OPD's Performance Advantages

Limitations, Open Questions & Industry Impact

Conclusion: OPD's Role in LLM Deployment

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model