# POPO: A New Paradigm for Enhancing Large Model Reasoning Capabilities Using Only Positive Samples

> This article introduces Positive-Only Policy Optimization (POPO), a new reinforcement learning method for training large language models' reasoning capabilities without negative samples, which outperforms GRPO by 6.67 percentage points on AIME 2025.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-07T17:55:21.000Z
- 最近活动: 2026-05-08T04:17:34.225Z
- 热度: 140.6
- 关键词: 强化学习, 大语言模型, 推理能力, GRPO, 正样本优化, RLVR, Qwen, 数学推理
- 页面链接: https://www.zingnex.cn/en/forum/thread/popo
- Canonical: https://www.zingnex.cn/forum/thread/popo
- Markdown 来源: floors_fallback

---

## [Introduction] POPO: A New Paradigm for Enhancing Large Model Reasoning Capabilities Without Negative Samples

This article introduces Positive-Only Policy Optimization (POPO) — a new reinforcement learning method for training large language models' reasoning capabilities without negative samples. This method addresses the problem in GRPO where negative samples fail to reflect the gradient of failure severity, outperforming GRPO by 6.67 percentage points on the AIME 2025 benchmark. Its core lies in enhancing model reasoning capabilities through optimization using only positive samples.

## Background: Evolution and Limitations from PPO to GRPO

In recent years, Reinforcement Learning with Verifiable Rewards (RLVR) has become the mainstream paradigm for enhancing large model reasoning capabilities. Group Relative Policy Optimization (GRPO) has made progress in mathematical reasoning tasks by simplifying the advantage estimation mechanism, but it has fundamental issues: negative samples may fail to reflect the gradient of failure severity, and in sparse binary reward scenarios, the reward signal is not rich enough, making it difficult for the model to learn fine-grained improvement directions.

## Core of POPO: Positive Sample Optimization with Complete Abandonment of Negative Samples

The core idea of POPO is to perform policy optimization entirely using online positive samples without explicitly using negative samples. It adopts bounded importance sampling technology, and the key insight is that implicit negative gradients can naturally emerge through the reallocation of positive sample probabilities: when the probability of generating positive samples is reinforced, the relative probabilities of other samples (including negative ones) naturally decrease, which is equivalent to an implicit gradient penalty, avoiding the noise and instability caused by negative samples.

## Stabilization Mechanisms: Twin Networks and Bounded Similarity Penalty

To improve training stability, POPO introduces two innovations:
1. **Twin Policy Networks**: Two policy networks sharing parameters, where the main network updates quickly and the twin network follows with momentum smoothing to stabilize policy evolution;
2. **Bounded Similarity Penalty**: Replaces the KL divergence constraint, calculating the similarity of policy distributions in the twin network's representation space, which is more efficient and stable.

## Experimental Evidence: POPO Outperforms GRPO Across the Board

Experimental results on Qwen series models are significant:
| Model | Method | AIME 2025 |
|------|------|-----------|
| Qwen-Math-7B | GRPO | 30.00% |
| Qwen-Math-7B | POPO | **36.67%** |
POPO improves by 6.67 percentage points on AIME 2025, and ablation experiments prove that twin networks and bounded similarity penalty are necessary stabilization measures.

## Technical Significance and Future Outlook

**Theoretical Aspect**: Challenges the assumption in the RL field that negative samples need to be explicitly handled, inspiring research on sample efficiency;
**Practical Aspect**: Simplifies the RLVR training process, reduces inference computation overhead by 50%, avoids negative sample selection rules, and reduces the space for hyperparameter tuning;
**Future Outlook**: Extend to tasks such as code generation and logical reasoning, and explore combining with computation expansion during testing to enhance reasoning depth.

## Conclusion: The Value and Impact of POPO

POPO is an important advancement in the post-training field of large language models. It achieves reinforcement learning without negative samples through probability distribution normalization constraints, maintaining stability while outperforming existing methods. It not only provides a plug-and-play training improvement solution but also offers a new perspective for understanding the essential mechanism of RLVR.
