Section 01
[Introduction] Prune-OPD: An Efficient and Reliable Solution for Policy Distillation in Long-Range Reasoning
This paper proposes the Prune-OPD framework, which addresses the prefix drift problem in policy distillation for long-range reasoning tasks. By dynamically monitoring the local consistency between student and teacher predictions, it reduces training time by 37.6%-68.0% while maintaining or even improving model performance, providing an efficient and reliable strategy for training long-range reasoning models.