Section 01
[Introduction] StableOPD: A New Framework to Address Length Inflation in Online Policy Distillation for Large Models
The research team reveals the length inflation and truncation collapse issues in Online Policy Distillation (OPD) training, proposes the StableOPD framework which combines reference divergence constraints and mixed rollout distillation, effectively improving training stability and achieving an average performance increase of 7.2% across multiple datasets.