Section 01
Introduction: MaxPO—A New Policy Gradient Method for Post-Training of Reasoning Models
This article introduces the MaxPO method, which solves the advantage estimation problem in max@K policy gradients using the Leave-Two-Out (L2O) baseline, providing a more stable optimization signal for post-training of Large Language Model (LLM) reasoning models. This method aims to alleviate the training challenges caused by sparse rewards in reasoning tasks, improving the stability and efficiency of model training.
Original paper source: arXiv (published on June 4, 2026, link: http://arxiv.org/abs/2606.06080v1)