Section 01
[Introduction] The True Role of OPSD in Reasoning Models: Compression Tool Rather Than Correction Tool
This article reveals the core role of OPSD (On-Policy Self-Distillation) in chain-of-thought reasoning—it is primarily a compression tool rather than a correction tool. In mathematical reasoning tasks, applying OPSD only to correct reasoning trajectories can maintain accuracy while significantly shortening output length, whereas applying it to incorrect trajectories harms performance. Based on this, the paper proposes a new post-training process: SFT→RLVR→OPSD, where each stage performs its own function to achieve efficient reasoning.