# POPO: A New Paradigm of Reinforcement Learning Without Negative Samples

> POPO performs policy optimization using only positive sample rollouts, achieves efficient learning via implicit negative gradients, and outperforms GRPO by 6.67 percentage points at AIME 2025.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-07T17:55:21.000Z
- 最近活动: 2026-05-08T07:21:43.527Z
- 热度: 133.6
- 关键词: 强化学习, RLVR, 策略优化, 正样本学习, 大语言模型, 数学推理
- 页面链接: https://www.zingnex.cn/en/forum/thread/popo-fd062ff1
- Canonical: https://www.zingnex.cn/forum/thread/popo-fd062ff1
- Markdown 来源: floors_fallback

---

## 【Introduction】Core Interpretation of POPO: A New Paradigm of Reinforcement Learning Without Negative Samples

POPO is a new paradigm of reinforcement learning without negative samples. It performs policy optimization using only positive sample rollouts and achieves efficient learning via implicit negative gradients. At AIME 2025, this framework achieved a score of 36.67% using the Qwen-Math-7B model, which is 6.67 percentage points higher than GRPO, challenging the traditional understanding that RLVR must rely on positive-negative sample comparison.

## Background: Evolution of RLVR and Inherent Defects of Negative Samples

Reinforcement Learning with Verifiable Rewards (RLVR) has become the mainstream paradigm for improving the reasoning ability of large language models. In the evolution from PPO to GRPO, algorithm simplification has brought efficiency improvements—GRPO replaces complex advantage estimation with simple estimation using grouped positive and negative samples. However, negative samples have inherent defects: there is no gradient distinction in the degree of failure, and combinatorial explosion makes it difficult for a small number of samples to cover meaningful reward signals.

## Core Solution of POPO: Policy Optimization Using Only Positive Samples

The POPO (Positive-Only Policy Optimization) framework proposed by the research team learns entirely through online positive sample rollouts. Its key insight is: by strengthening the probability of positive samples, implicit negative gradients will naturally emerge—while increasing the probability of positive samples, the probability of negative samples is relatively reduced, achieving optimization effects without explicit negative samples. This framework uses bounded importance sampling to process the positive sample set and does not rely on any negative samples for gradient guidance.

## Training Stability Mechanisms of POPO

POPO stabilizes policy optimization through two mechanisms:
1. **Twin Policy Networks and Momentum Adaptation**: Adopts a twin policy network structure, and uses momentum-based adaptive rules to achieve stable policy evolution and avoid training oscillations.
2. **Bounded Similarity Penalty**: Replaces the traditional KL divergence constraint with a bounded similarity penalty term in the representation space, providing a more flexible optimization space while keeping the policy from deviating from the reference point.

## Experimental Evidence: Performance of POPO

The research team conducted experiments on multiple mathematical benchmarks using public mainstream text large models such as the Qwen series:
- POPO's performance is comparable to or even better than GRPO;
- Qwen-Math-7B achieved 36.67% at AIME 2025, exceeding GRPO's 30.00%;
- Ablation studies and parameter scans verified the necessity and robustness of each component.

## Conclusion: Significance and Breakthroughs of POPO

The success of POPO challenges the traditional understanding that RLVR must rely on positive-negative sample comparison. It simplifies algorithm implementation (no need to generate and manage negative samples) and may also avoid noise and bias caused by negative samples, which has important practical value for large-scale RL training that requires a large number of samples.

## Future Research Suggestions

In the future, we can further explore the applicability of POPO in other task types (such as code generation, scientific reasoning) and the possibility of combining it with other optimization techniques.