# MCPO: Enhancing Large Model Reasoning Ability via Mastery Consolidation and Optimization

> To address the vanishing training signal problem of the GRPO algorithm on mastered prompts (near 100% accuracy) and mostly correct prompts (50%-100% accuracy), we propose the MCPO framework. By optimizing policy updates through hinge KL regularization and a weighting mechanism, it continuously improves pass@1 performance on mathematical reasoning benchmarks and unexpectedly enhances pass@k diversity.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-18T11:43:08.000Z
- 最近活动: 2026-04-21T01:55:53.272Z
- 热度: 88.8
- 关键词: RLVR, GRPO, Policy Optimization, Reasoning Models, Mathematical Reasoning, Catastrophic Forgetting, Exploration Diversity, LLM Training
- 页面链接: https://www.zingnex.cn/en/forum/thread/mcpo
- Canonical: https://www.zingnex.cn/forum/thread/mcpo
- Markdown 来源: floors_fallback

---

## MCPO: Enhancing Large Model Reasoning Ability via Mastery Consolidation and Optimization

# MCPO: Enhancing Large Model Reasoning Ability via Mastery Consolidation and Optimization

Aiming at the training signal issue of the GRPO algorithm on mastered prompts (accuracy close to 100%) and mostly correct prompts (50%-100% accuracy), this paper proposes the MCPO framework. Core innovations include hinge KL regularization (constraining policy drift for mastered prompts) and a weighting mechanism for mostly correct prompts. It achieves continuous improvement in pass@1 performance on mathematical reasoning benchmarks and unexpectedly enhances pass@k diversity.

## Background: The Rise of RLVR and GRPO

## Background: The Rise of RLVR and GRPO

Reinforcement Learning with Verifiable Rewards (RLVR) leverages automatic verification signals (such as mathematical correctness) to enhance large model reasoning ability without manual reward annotation. As a member of the RLVR family, GRPO calculates the advantage function by comparing the relative quality of multiple outputs under the same prompt, avoiding the overhead of training a separate critic model in traditional PPO and achieving efficient performance.

## Core Issues of GRPO

## Core Issues of GRPO

### Problem 1: Vanishing Training Signals for Mastered Prompts
When prompt accuracy is close to 100%, all sampled outputs are correct, and the relative advantage approaches zero, leading to no effective training signals, policy drift, and catastrophic forgetting.

### Problem 2: Weight Decay for Mostly Correct Prompts
For prompts with accuracy between 50% and 100%, GRPO's query weight shrinks as accuracy increases. This reduces the model's optimization intensity during the phase from partial correctness to full mastery, weakening consolidation learning.

## Key Innovations of MCPO

## Key Innovations of MCPO

### Innovation 1: Hinge KL Regularization
For mastered prompts, a hinge loss mechanism constrains drastic policy distribution changes—punishment is applied only when drift exceeds a threshold, preventing catastrophic forgetting while retaining beneficial exploration.

### Innovation 2: Weighting Mechanism for Mostly Correct Prompts
Re-weighting mostly correct prompts ensures the model receives sufficient training signals when approaching mastery, enabling a smooth transition to full mastery and improving learning efficiency.

## Experimental Results: Dual Improvement in Performance and Diversity

## Experimental Results: Dual Improvement in Performance and Diversity

On three mathematical benchmarks—GSM8K (elementary school math), MATH (competition level), and OlympiadBench (Olympiad)—MCPO continuously improves pass@1 (single-sample accuracy).

Unexpected finding: pass@k (probability of at least one correct answer in k samples) is significantly enhanced, reflecting increased diversity in the solution space. This breaks traditional perceptions: consolidation learning not only does not limit exploration but also catalyzes diversity; a stable base policy provides a solid starting point for exploration.

## Reasons for MCPO's Effectiveness

## Reasons for MCPO's Effectiveness

### Stable Foundation Promotes Exploration
By preventing the forgetting of mastered knowledge, the model gains a stable and reliable foundation, allowing it to explore new areas more confidently without worrying about damaging existing knowledge—making exploration more efficient.

### Optimized Resource Allocation
Re-weighting mostly correct prompts avoids wasting computation on mastered prompts, ensuring problems close to mastery receive sufficient attention, leading to a smoother and more efficient learning curve.

## Implications and Future Directions

## Implications and Future Directions

### Implications for RLVR Practice
- Monitor the distribution of prompt mastery
- Implement special handling for mastered prompts (e.g., regularization)
- Dynamically adjust prompt weights to optimize learning

### Limitations and Future
Current limitations: Experiments are focused on the math domain; the hinge KL threshold requires task-specific tuning; the effect on ultra-large-scale models remains untested.
Future directions: Cross-domain verification (code generation, scientific reasoning); adaptive thresholds; combined strategies; theoretical analysis of the mathematical relationship between mastery and diversity.
