# MaxPO: A New Policy Gradient Method for Post-Training of Reasoning Models

> This article introduces the MaxPO method, which addresses the advantage estimation problem in max@K policy gradients using the Leave-Two-Out baseline, providing a more stable optimization signal for post-training of LLM reasoning models.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-04T12:16:39.000Z
- 最近活动: 2026-06-05T11:17:43.733Z
- 热度: 119.0
- 关键词: 强化学习, 策略梯度, 推理模型, 后训练, max@K, GRPO, 优势估计, LLM优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/maxpo
- Canonical: https://www.zingnex.cn/forum/thread/maxpo
- Markdown 来源: floors_fallback

---

## Introduction: MaxPO—A New Policy Gradient Method for Post-Training of Reasoning Models

This article introduces the MaxPO method, which solves the advantage estimation problem in max@K policy gradients using the Leave-Two-Out (L2O) baseline, providing a more stable optimization signal for post-training of Large Language Model (LLM) reasoning models. This method aims to alleviate the training challenges caused by sparse rewards in reasoning tasks, improving the stability and efficiency of model training.

Original paper source: arXiv (published on June 4, 2026, link: http://arxiv.org/abs/2606.06080v1)

## Background: Challenges in Post-Training of Reasoning Models and Dilemmas of Existing Methods

### Challenges in Post-Training of Reasoning Models
The reasoning ability of large language models relies on post-training with reinforcement learning, but reasoning tasks have sparse rewards (rewards are only given when the final answer is correct), leading to difficulties in model exploration and making it hard to learn from failures for improvement.

### Dilemmas of Existing Methods
To alleviate sparse rewards, researchers have proposed optimizing the max@K objective (expected reward of the best result among K attempts), but existing estimators have issues such as ambiguous relationships and non-centered advantage estimation, which easily lead to deviations in gradient update directions and unstable training.

## MaxPO Method: Leave-Two-Out Baseline and Theoretical Contributions

### Core Innovation: Leave-Two-Out (L2O) Baseline
When evaluating the contribution of a sample to max@K, exclude the sample and the most competitive sample in the current batch to ensure the centrality of advantage estimation (the expected value within the batch is zero), reducing gradient variance.

### Algorithm Implementation
Quadratic time complexity, efficient GPU parallelization, compatible with group-based reinforcement learning frameworks like GRPO, no need to modify existing training pipelines.

### Theoretical Contributions
Derive the canonical advantage estimation for the max@K objective, unifying the interpretation framework of existing methods: existing methods are approximations of the canonical estimation, with differences in baseline selection and normalization strategies; the L2O baseline balances variance and bias.

## Experimental Validation: Effectiveness of MaxPO

### Reduction in Gradient Variance
The L2O baseline reduces the variance of gradient estimation, lowering the risk of training oscillations and divergence in high-dimensional policy spaces without requiring smaller learning rates or longer convergence times.

### Performance Improvement
Compared to non-centered schemes, MaxPO performs better on multiple reasoning tasks; the improvement comes from more precise gradient signals, not relying on complex structures or additional resources.

## Practical Significance and Future Outlook

### Practical Value
1. **Training Stability**: Centered advantage estimation reduces the risk of training oscillations and divergence;
2. **Sample Efficiency**: Precise gradients extract more information from the same samples, reducing computational costs;
3. **Generality**: Applicable to max@K scenarios such as mathematical reasoning, code generation, theorem proving, etc.;
4. **Compatibility**: Seamlessly integrates with mainstream RL frameworks like GRPO and PPO, plug-and-play.

### Outlook
Can be further extended to more task scenarios, providing a basic tool for LLM reasoning optimization.

## Conclusion: Long-Term Value of MaxPO

Through rigorous mathematical derivation and exquisite algorithm design, MaxPO provides a reliable basic component for post-training of reasoning models. In the competition for LLM reasoning capabilities, improving basic optimization methods has more long-term value than chasing model scale; breakthroughs often come from careful examination of existing methods rather than blind accumulation of complexity.
