Zing Forum

Reading

POPO: A New Paradigm for Enhancing Large Model Reasoning Capabilities Using Only Positive Samples

This article introduces Positive-Only Policy Optimization (POPO), a new reinforcement learning method for training large language models' reasoning capabilities without negative samples, which outperforms GRPO by 6.67 percentage points on AIME 2025.

强化学习大语言模型推理能力GRPO正样本优化RLVRQwen数学推理
Published 2026-05-08 01:55Recent activity 2026-05-08 12:17Estimated read 6 min
POPO: A New Paradigm for Enhancing Large Model Reasoning Capabilities Using Only Positive Samples
1

Section 01

[Introduction] POPO: A New Paradigm for Enhancing Large Model Reasoning Capabilities Without Negative Samples

This article introduces Positive-Only Policy Optimization (POPO) — a new reinforcement learning method for training large language models' reasoning capabilities without negative samples. This method addresses the problem in GRPO where negative samples fail to reflect the gradient of failure severity, outperforming GRPO by 6.67 percentage points on the AIME 2025 benchmark. Its core lies in enhancing model reasoning capabilities through optimization using only positive samples.

2

Section 02

Background: Evolution and Limitations from PPO to GRPO

In recent years, Reinforcement Learning with Verifiable Rewards (RLVR) has become the mainstream paradigm for enhancing large model reasoning capabilities. Group Relative Policy Optimization (GRPO) has made progress in mathematical reasoning tasks by simplifying the advantage estimation mechanism, but it has fundamental issues: negative samples may fail to reflect the gradient of failure severity, and in sparse binary reward scenarios, the reward signal is not rich enough, making it difficult for the model to learn fine-grained improvement directions.

3

Section 03

Core of POPO: Positive Sample Optimization with Complete Abandonment of Negative Samples

The core idea of POPO is to perform policy optimization entirely using online positive samples without explicitly using negative samples. It adopts bounded importance sampling technology, and the key insight is that implicit negative gradients can naturally emerge through the reallocation of positive sample probabilities: when the probability of generating positive samples is reinforced, the relative probabilities of other samples (including negative ones) naturally decrease, which is equivalent to an implicit gradient penalty, avoiding the noise and instability caused by negative samples.

4

Section 04

Stabilization Mechanisms: Twin Networks and Bounded Similarity Penalty

To improve training stability, POPO introduces two innovations:

  1. Twin Policy Networks: Two policy networks sharing parameters, where the main network updates quickly and the twin network follows with momentum smoothing to stabilize policy evolution;
  2. Bounded Similarity Penalty: Replaces the KL divergence constraint, calculating the similarity of policy distributions in the twin network's representation space, which is more efficient and stable.
5

Section 05

Experimental Evidence: POPO Outperforms GRPO Across the Board

Experimental results on Qwen series models are significant:

Model Method AIME 2025
Qwen-Math-7B GRPO 30.00%
Qwen-Math-7B POPO 36.67%
POPO improves by 6.67 percentage points on AIME 2025, and ablation experiments prove that twin networks and bounded similarity penalty are necessary stabilization measures.
6

Section 06

Technical Significance and Future Outlook

Theoretical Aspect: Challenges the assumption in the RL field that negative samples need to be explicitly handled, inspiring research on sample efficiency; Practical Aspect: Simplifies the RLVR training process, reduces inference computation overhead by 50%, avoids negative sample selection rules, and reduces the space for hyperparameter tuning; Future Outlook: Extend to tasks such as code generation and logical reasoning, and explore combining with computation expansion during testing to enhance reasoning depth.

7

Section 07

Conclusion: The Value and Impact of POPO

POPO is an important advancement in the post-training field of large language models. It achieves reinforcement learning without negative samples through probability distribution normalization constraints, maintaining stability while outperforming existing methods. It not only provides a plug-and-play training improvement solution but also offers a new perspective for understanding the essential mechanism of RLVR.