Reading

MaxPO: A New Policy Gradient Method for Post-Training of Reasoning Models

This article introduces the MaxPO method, which addresses the advantage estimation problem in max@K policy gradients using the Leave-Two-Out baseline, providing a more stable optimization signal for post-training of LLM reasoning models.

强化学习策略梯度推理模型后训练max@KGRPO优势估计LLM优化

Published 2026-06-04 20:16Recent activity 2026-06-05 19:17Estimated read 6 min

Section 01

Introduction: MaxPO—A New Policy Gradient Method for Post-Training of Reasoning Models

This article introduces the MaxPO method, which solves the advantage estimation problem in max@K policy gradients using the Leave-Two-Out (L2O) baseline, providing a more stable optimization signal for post-training of Large Language Model (LLM) reasoning models. This method aims to alleviate the training challenges caused by sparse rewards in reasoning tasks, improving the stability and efficiency of model training.

Original paper source: arXiv (published on June 4, 2026, link: http://arxiv.org/abs/2606.06080v1)

Section 02

Background: Challenges in Post-Training of Reasoning Models and Dilemmas of Existing Methods

Challenges in Post-Training of Reasoning Models

The reasoning ability of large language models relies on post-training with reinforcement learning, but reasoning tasks have sparse rewards (rewards are only given when the final answer is correct), leading to difficulties in model exploration and making it hard to learn from failures for improvement.

Dilemmas of Existing Methods

To alleviate sparse rewards, researchers have proposed optimizing the max@K objective (expected reward of the best result among K attempts), but existing estimators have issues such as ambiguous relationships and non-centered advantage estimation, which easily lead to deviations in gradient update directions and unstable training.

Section 03

MaxPO Method: Leave-Two-Out Baseline and Theoretical Contributions

Core Innovation: Leave-Two-Out (L2O) Baseline

When evaluating the contribution of a sample to max@K, exclude the sample and the most competitive sample in the current batch to ensure the centrality of advantage estimation (the expected value within the batch is zero), reducing gradient variance.

Algorithm Implementation

Quadratic time complexity, efficient GPU parallelization, compatible with group-based reinforcement learning frameworks like GRPO, no need to modify existing training pipelines.

Theoretical Contributions

Derive the canonical advantage estimation for the max@K objective, unifying the interpretation framework of existing methods: existing methods are approximations of the canonical estimation, with differences in baseline selection and normalization strategies; the L2O baseline balances variance and bias.

Section 04

Experimental Validation: Effectiveness of MaxPO

Reduction in Gradient Variance

The L2O baseline reduces the variance of gradient estimation, lowering the risk of training oscillations and divergence in high-dimensional policy spaces without requiring smaller learning rates or longer convergence times.

Performance Improvement

Compared to non-centered schemes, MaxPO performs better on multiple reasoning tasks; the improvement comes from more precise gradient signals, not relying on complex structures or additional resources.

Section 05

Practical Significance and Future Outlook

Practical Value

Training Stability: Centered advantage estimation reduces the risk of training oscillations and divergence;
Sample Efficiency: Precise gradients extract more information from the same samples, reducing computational costs;
Generality: Applicable to max@K scenarios such as mathematical reasoning, code generation, theorem proving, etc.;
Compatibility: Seamlessly integrates with mainstream RL frameworks like GRPO and PPO, plug-and-play.

Outlook

Can be further extended to more task scenarios, providing a basic tool for LLM reasoning optimization.

Section 06

Conclusion: Long-Term Value of MaxPO

Through rigorous mathematical derivation and exquisite algorithm design, MaxPO provides a reliable basic component for post-training of reasoning models. In the competition for LLM reasoning capabilities, improving basic optimization methods has more long-term value than chasing model scale; breakthroughs often come from careful examination of existing methods rather than blind accumulation of complexity.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49