Zing Forum

Reading

MaxPO: A New Policy Gradient Method for Post-Training of Reasoning Models

This article introduces the MaxPO method, which addresses the advantage estimation problem in max@K policy gradients using the Leave-Two-Out baseline, providing a more stable optimization signal for post-training of LLM reasoning models.

强化学习策略梯度推理模型后训练max@KGRPO优势估计LLM优化
Published 2026-06-04 20:16Recent activity 2026-06-05 19:17Estimated read 6 min
MaxPO: A New Policy Gradient Method for Post-Training of Reasoning Models
1

Section 01

Introduction: MaxPO—A New Policy Gradient Method for Post-Training of Reasoning Models

This article introduces the MaxPO method, which solves the advantage estimation problem in max@K policy gradients using the Leave-Two-Out (L2O) baseline, providing a more stable optimization signal for post-training of Large Language Model (LLM) reasoning models. This method aims to alleviate the training challenges caused by sparse rewards in reasoning tasks, improving the stability and efficiency of model training.

Original paper source: arXiv (published on June 4, 2026, link: http://arxiv.org/abs/2606.06080v1)

2

Section 02

Background: Challenges in Post-Training of Reasoning Models and Dilemmas of Existing Methods

Challenges in Post-Training of Reasoning Models

The reasoning ability of large language models relies on post-training with reinforcement learning, but reasoning tasks have sparse rewards (rewards are only given when the final answer is correct), leading to difficulties in model exploration and making it hard to learn from failures for improvement.

Dilemmas of Existing Methods

To alleviate sparse rewards, researchers have proposed optimizing the max@K objective (expected reward of the best result among K attempts), but existing estimators have issues such as ambiguous relationships and non-centered advantage estimation, which easily lead to deviations in gradient update directions and unstable training.

3

Section 03

MaxPO Method: Leave-Two-Out Baseline and Theoretical Contributions

Core Innovation: Leave-Two-Out (L2O) Baseline

When evaluating the contribution of a sample to max@K, exclude the sample and the most competitive sample in the current batch to ensure the centrality of advantage estimation (the expected value within the batch is zero), reducing gradient variance.

Algorithm Implementation

Quadratic time complexity, efficient GPU parallelization, compatible with group-based reinforcement learning frameworks like GRPO, no need to modify existing training pipelines.

Theoretical Contributions

Derive the canonical advantage estimation for the max@K objective, unifying the interpretation framework of existing methods: existing methods are approximations of the canonical estimation, with differences in baseline selection and normalization strategies; the L2O baseline balances variance and bias.

4

Section 04

Experimental Validation: Effectiveness of MaxPO

Reduction in Gradient Variance

The L2O baseline reduces the variance of gradient estimation, lowering the risk of training oscillations and divergence in high-dimensional policy spaces without requiring smaller learning rates or longer convergence times.

Performance Improvement

Compared to non-centered schemes, MaxPO performs better on multiple reasoning tasks; the improvement comes from more precise gradient signals, not relying on complex structures or additional resources.

5

Section 05

Practical Significance and Future Outlook

Practical Value

  1. Training Stability: Centered advantage estimation reduces the risk of training oscillations and divergence;
  2. Sample Efficiency: Precise gradients extract more information from the same samples, reducing computational costs;
  3. Generality: Applicable to max@K scenarios such as mathematical reasoning, code generation, theorem proving, etc.;
  4. Compatibility: Seamlessly integrates with mainstream RL frameworks like GRPO and PPO, plug-and-play.

Outlook

Can be further extended to more task scenarios, providing a basic tool for LLM reasoning optimization.

6

Section 06

Conclusion: Long-Term Value of MaxPO

Through rigorous mathematical derivation and exquisite algorithm design, MaxPO provides a reliable basic component for post-training of reasoning models. In the competition for LLM reasoning capabilities, improving basic optimization methods has more long-term value than chasing model scale; breakthroughs often come from careful examination of existing methods rather than blind accumulation of complexity.