Zing Forum

Reading

New Paradigm for Large Language Model Training: A Detailed Explanation of On-Policy Distillation Technology and Its Cutting-Edge Developments

This article delves into the On-Policy Distillation (OPD) technology for large language models, analyzing its advantages over traditional off-policy distillation and its innovative mechanisms in addressing exposure bias and error accumulation issues.

大语言模型On-Policy DistillationAI训练知识蒸馏暴露偏差策略内蒸馏机器学习AI优化
Published 2026-05-08 23:01Recent activity 2026-05-08 23:07Estimated read 7 min
New Paradigm for Large Language Model Training: A Detailed Explanation of On-Policy Distillation Technology and Its Cutting-Edge Developments
1

Section 01

Main Floor: New Paradigm for Large Language Model Training — Core Analysis of On-Policy Distillation Technology

This article focuses on the On-Policy Distillation (OPD) technology for large language model training, analyzing its advantages over traditional off-policy distillation (e.g., SFT). It emphasizes the innovative mechanisms that address exposure bias, error accumulation, and training-test mismatch issues, and introduces the current development status, application prospects, and significance of this technology for the AI industry.

2

Section 02

Background: Limitations of Traditional Off-Policy Distillation Methods

Traditional off-policy distillation (e.g., SFT) has the problem of exposure bias: during training, the student model relies on perfect teacher prefixes to predict the next token, but during inference, it has to base predictions on its own potentially flawed sequences, leading to training-inference mismatch. Errors accumulate and amplify in long sequence generation; after the rise of reasoning models (System 2 thinking) from 2024 to 2026, long-chain thinking makes error accumulation more severe, and traditional SFT can no longer meet the needs of extended reasoning, thus driving the development of OPD technology.

3

Section 03

Methodology: Core Mechanisms of OPD Technology

The core of OPD is to let the student model generate trajectories from its own distribution, then use teacher/reward models to evaluate and correct errors. Classified by signal source: white-box (access to full logits, e.g., GKD, MiniLLM; ULD/DSKD for different tokenizers), black-box (only API outputs, e.g., Lion, GAD), self-distillation (no teacher, including validator/reward model methods like SDPO, privileged information methods like OPSD, pure self-iteration methods like SPIN). Objective functions include fixed divergence (forward/reverse KL, JSD), adaptive divergence (AKL, ToDi), and reinforcement learning enhancement (G-OPD, RLAD).

4

Section 04

Cutting-Edge: Latest Development Trends of OPD Technology

Recent trends in the OPD field include: 1. Shifting from reverse KL to adaptive switching (e.g., AKL, token-level gating) to balance exploration and guidance; 2. The rise of self-distillation (e.g., SDPO, SDZero) dominating the field; 3. The discovery that only 20-50% of high-entropy/divergent tokens need to apply distillation loss; 4. Intelligent agent OPD (e.g., TCOD, Skill-SD) to solve error accumulation in multi-round tool use; 5. Industrial integration (models like DeepSeek-V4 and Qwen3 are incorporated into training pipelines); 6. The existence of diversity collapse issues (Pass@1 improves but Pass@k decreases).

5

Section 05

Guide: Practical Recommendations for OPD Technology Selection

Practitioners can refer to the following when choosing OPD methods: 1. Teacher logit access: Yes → white-box (use GKD etc. for same tokenizer, ULD etc. for different ones, Lion etc. for API-only); No → self-distillation (use SDPO etc. with validator, OPSD etc. with privileged context, SPIN etc. for pure self-iteration). 2. Objective function: Use KL/JSD for fixed benchmarks, AKL etc. for adaptive cases, G-OPD etc. for reward shaping. 3. Training issues: Use dynamic toolkits like TIP/SCOPE to solve instability or inefficiency.

6

Section 06

Applications: Industrial Practices of OPD Technology

OPD has been widely applied in mathematical reasoning, code generation, and complex reasoning tasks. For mathematical reasoning, one can follow the OPSD→RLKD→SCOPE path; for multi-round agent construction, pay attention to TCOD (temporal curriculum) or Skill-SD (skill-conditional self-distillation); cutting-edge models like DeepSeek-V4, Qwen3, and Nemotron have integrated OPD into their training pipelines.

7

Section 07

Challenges: Open Issues Faced by OPD Technology

OPD still faces challenges: 1. High computational complexity (requires more sampling and evaluation steps); 2. Poor training stability (needs fine hyperparameter tuning); 3. Lack of suitable evaluation standards; 4. Need to balance accuracy and output diversity (diversity collapse issue remains to be solved).

8

Section 08

Conclusion: Value of OPD Technology and Opportunities for China's AI Industry

OPD is an important progress in LLM training, solving the fundamental limitations of traditional methods and providing possibilities for building more reliable AI systems. For China's AI industry, OPD offers new ideas for optimizing Chinese large models, and domestic institutions can develop adaptive methods combining Chinese characteristics. In the future, OPD will play a key role in AI development, and researchers and engineers need to master this technology to maintain competitiveness.