Section 01
MCPO: Enhancing Large Model Reasoning Ability via Mastery Consolidation and Optimization
MCPO: Enhancing Large Model Reasoning Ability via Mastery Consolidation and Optimization
Aiming at the training signal issue of the GRPO algorithm on mastered prompts (accuracy close to 100%) and mostly correct prompts (50%-100% accuracy), this paper proposes the MCPO framework. Core innovations include hinge KL regularization (constraining policy drift for mastered prompts) and a weighting mechanism for mostly correct prompts. It achieves continuous improvement in pass@1 performance on mathematical reasoning benchmarks and unexpectedly enhances pass@k diversity.