Section 01
[Introduction] OPSD: Core Analysis of a New On-Policy Self-Distillation Method for Large Language Models
OPSD (On-Policy Self-Distillation) is an innovative training method for large language models, with its core mechanism being on-policy self-distillation to achieve token-level reasoning optimization. This method does not require an independent teacher model; instead, it uses the model's current strategy to generate soft targets for self-learning. While maintaining computational efficiency, it significantly improves reasoning ability, data efficiency, and generalization performance, providing an efficient solution for resource-constrained or annotation-scarce scenarios.