Zing Forum

Reading

OPSD: A New On-Policy Self-Distillation Training Method for Large Language Models

OPSD (On-Policy Self-Distillation) is an innovative training method for large language models. It achieves token-level reasoning optimization through an on-policy self-distillation mechanism, significantly improving model performance while maintaining computational efficiency.

大语言模型知识蒸馏自蒸馏在线学习token级优化模型训练机器学习推理能力
Published 2026-04-28 12:15Recent activity 2026-04-28 12:18Estimated read 7 min
OPSD: A New On-Policy Self-Distillation Training Method for Large Language Models
1

Section 01

[Introduction] OPSD: Core Analysis of a New On-Policy Self-Distillation Method for Large Language Models

OPSD (On-Policy Self-Distillation) is an innovative training method for large language models, with its core mechanism being on-policy self-distillation to achieve token-level reasoning optimization. This method does not require an independent teacher model; instead, it uses the model's current strategy to generate soft targets for self-learning. While maintaining computational efficiency, it significantly improves reasoning ability, data efficiency, and generalization performance, providing an efficient solution for resource-constrained or annotation-scarce scenarios.

2

Section 02

Background and Challenges: Existing Pain Points in LLM Training

In large language model training, traditional Supervised Fine-Tuning (SFT) has limited performance in complex reasoning tasks. Existing challenges include: high cost of acquiring high-quality annotated data; traditional distillation requires pre-training a teacher model, increasing complexity; token-level fine-grained reasoning optimization remains unsolved. These issues have spurred the demand for new training paradigms.

3

Section 03

Core of OPSD Method: On-Policy Self-Distillation and Token-Level Optimization

The core idea of OPSD is that the model acts as its own teacher, learning through online generation of target distributions for self-distillation. Key innovations include:

  1. Token-level reasoning optimization: Fine-grained supervision of each generation step, using soft targets (probability distributions) instead of hard labels to obtain richer gradient signals;
  2. On-policy learning: Using the current strategy to generate samples, quickly adapting to learning progress, reducing dependence on external data, and balancing exploration and exploitation;
  3. Self-distillation framework: Eliminating the need for large teacher models, reducing computational overhead, enabling more efficient knowledge transfer, and using noise as a regularization to prevent overfitting.
4

Section 04

OPSD Training Process and Implementation Details

The training process consists of four steps:

  1. Forward generation: Generate responses from input prompts and record the probability distribution at each position;
  2. Target construction: Use the generated probability distribution as the soft target;
  3. Backward optimization: Minimize the difference between predictions and soft targets via KL divergence to update parameters;
  4. Iterative loop: Repeat the above steps for continuous improvement. In implementation, gradient clipping and learning rate scheduling are combined to ensure stability, and a temperature parameter is introduced to adjust the sharpness of the probability distribution.
5

Section 05

Performance Advantages and Application Scenarios of OPSD

Advantages:

  • Computational efficiency: No independent teacher model, reducing memory and computational overhead;
  • Reasoning ability: Token-level optimization improves multi-step reasoning (e.g., math, code generation);
  • Data efficiency: Self-distillation reduces dependence on large-scale annotated data;
  • Generalization performance: On-policy strategy adapts to new data distributions. Application scenarios: Resource-constrained environments, annotation-scarce fields (medical/legal), and improving existing models.
6

Section 06

Limitations of OPSD and Future Research Directions

Limitations: Low-quality samples in the early stage may lead to error accumulation; it is easy to fall into local optima in the later stage of training. Future directions: Introduce curriculum learning to gradually increase sample difficulty; combine offline pre-training + online policy fine-tuning; explore multi-model collaborative self-distillation frameworks.

7

Section 07

Summary and Outlook: The Significance of OPSD for LLM Training

OPSD balances computational efficiency, reasoning ability, and data efficiency, providing researchers and practitioners with an effective solution for resource-constrained scenarios. Its ideas of self-learning and fine-grained optimization are expected to play a greater role in future LLM training, and have important reference value for balancing AI efficiency and performance.