Reading

POPO: A New Paradigm of Reinforcement Learning Without Negative Samples

POPO performs policy optimization using only positive sample rollouts, achieves efficient learning via implicit negative gradients, and outperforms GRPO by 6.67 percentage points at AIME 2025.

强化学习RLVR策略优化正样本学习大语言模型数学推理

Published 2026-05-08 01:55Recent activity 2026-05-08 15:21Estimated read 5 min

Section 01

【Introduction】Core Interpretation of POPO: A New Paradigm of Reinforcement Learning Without Negative Samples

POPO is a new paradigm of reinforcement learning without negative samples. It performs policy optimization using only positive sample rollouts and achieves efficient learning via implicit negative gradients. At AIME 2025, this framework achieved a score of 36.67% using the Qwen-Math-7B model, which is 6.67 percentage points higher than GRPO, challenging the traditional understanding that RLVR must rely on positive-negative sample comparison.

Section 02

Background: Evolution of RLVR and Inherent Defects of Negative Samples

Reinforcement Learning with Verifiable Rewards (RLVR) has become the mainstream paradigm for improving the reasoning ability of large language models. In the evolution from PPO to GRPO, algorithm simplification has brought efficiency improvements—GRPO replaces complex advantage estimation with simple estimation using grouped positive and negative samples. However, negative samples have inherent defects: there is no gradient distinction in the degree of failure, and combinatorial explosion makes it difficult for a small number of samples to cover meaningful reward signals.

Section 03

Core Solution of POPO: Policy Optimization Using Only Positive Samples

The POPO (Positive-Only Policy Optimization) framework proposed by the research team learns entirely through online positive sample rollouts. Its key insight is: by strengthening the probability of positive samples, implicit negative gradients will naturally emerge—while increasing the probability of positive samples, the probability of negative samples is relatively reduced, achieving optimization effects without explicit negative samples. This framework uses bounded importance sampling to process the positive sample set and does not rely on any negative samples for gradient guidance.

Section 04

Training Stability Mechanisms of POPO

POPO stabilizes policy optimization through two mechanisms:

Twin Policy Networks and Momentum Adaptation: Adopts a twin policy network structure, and uses momentum-based adaptive rules to achieve stable policy evolution and avoid training oscillations.
Bounded Similarity Penalty: Replaces the traditional KL divergence constraint with a bounded similarity penalty term in the representation space, providing a more flexible optimization space while keeping the policy from deviating from the reference point.

Section 05

Experimental Evidence: Performance of POPO

The research team conducted experiments on multiple mathematical benchmarks using public mainstream text large models such as the Qwen series:

POPO's performance is comparable to or even better than GRPO;
Qwen-Math-7B achieved 36.67% at AIME 2025, exceeding GRPO's 30.00%;
Ablation studies and parameter scans verified the necessity and robustness of each component.

Section 06

Conclusion: Significance and Breakthroughs of POPO

The success of POPO challenges the traditional understanding that RLVR must rely on positive-negative sample comparison. It simplifies algorithm implementation (no need to generate and manage negative samples) and may also avoid noise and bias caused by negative samples, which has important practical value for large-scale RL training that requires a large number of samples.

Section 07

Future Research Suggestions

In the future, we can further explore the applicability of POPO in other task types (such as code generation, scientific reasoning) and the possibility of combining it with other optimization techniques.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15