Section 01
【Introduction】Core Interpretation of POPO: A New Paradigm of Reinforcement Learning Without Negative Samples
POPO is a new paradigm of reinforcement learning without negative samples. It performs policy optimization using only positive sample rollouts and achieves efficient learning via implicit negative gradients. At AIME 2025, this framework achieved a score of 36.67% using the Qwen-Math-7B model, which is 6.67 percentage points higher than GRPO, challenging the traditional understanding that RLVR must rely on positive-negative sample comparison.