Section 01
[Overview] Unveiling the Truth of Policy Distillation: When It Works, When It Harms, and Why
The core research in this article proposes a training-free token-wise diagnostic framework. Through detailed analysis, it finds: policy distillation is more helpful on incorrect reasoning paths, while teacher signals may introduce noise on correct paths. This framework can quantify the alignment between distillation configurations and ideal gradients, providing key guidance for training reasoning models, explaining the unstable effects of distillation, and offering optimization directions.