Section 01
[Introduction] POPO: A New Paradigm for Enhancing Large Model Reasoning Capabilities Without Negative Samples
This article introduces Positive-Only Policy Optimization (POPO) — a new reinforcement learning method for training large language models' reasoning capabilities without negative samples. This method addresses the problem in GRPO where negative samples fail to reflect the gradient of failure severity, outperforming GRPO by 6.67 percentage points on the AIME 2025 benchmark. Its core lies in enhancing model reasoning capabilities through optimization using only positive samples.