Reading

POPO: A New Paradigm for Enhancing Large Model Reasoning Capabilities Using Only Positive Samples

This article introduces Positive-Only Policy Optimization (POPO), a new reinforcement learning method for training large language models' reasoning capabilities without negative samples, which outperforms GRPO by 6.67 percentage points on AIME 2025.

强化学习大语言模型推理能力GRPO正样本优化RLVRQwen数学推理

Published 2026-05-08 01:55Recent activity 2026-05-08 12:17Estimated read 6 min

POPO: A New Paradigm for Enhancing Large Model Reasoning Capabilities Using Only Positive Samples

Section 01

[Introduction] POPO: A New Paradigm for Enhancing Large Model Reasoning Capabilities Without Negative Samples

This article introduces Positive-Only Policy Optimization (POPO) — a new reinforcement learning method for training large language models' reasoning capabilities without negative samples. This method addresses the problem in GRPO where negative samples fail to reflect the gradient of failure severity, outperforming GRPO by 6.67 percentage points on the AIME 2025 benchmark. Its core lies in enhancing model reasoning capabilities through optimization using only positive samples.

Section 02

Background: Evolution and Limitations from PPO to GRPO

In recent years, Reinforcement Learning with Verifiable Rewards (RLVR) has become the mainstream paradigm for enhancing large model reasoning capabilities. Group Relative Policy Optimization (GRPO) has made progress in mathematical reasoning tasks by simplifying the advantage estimation mechanism, but it has fundamental issues: negative samples may fail to reflect the gradient of failure severity, and in sparse binary reward scenarios, the reward signal is not rich enough, making it difficult for the model to learn fine-grained improvement directions.

Section 03

Core of POPO: Positive Sample Optimization with Complete Abandonment of Negative Samples

The core idea of POPO is to perform policy optimization entirely using online positive samples without explicitly using negative samples. It adopts bounded importance sampling technology, and the key insight is that implicit negative gradients can naturally emerge through the reallocation of positive sample probabilities: when the probability of generating positive samples is reinforced, the relative probabilities of other samples (including negative ones) naturally decrease, which is equivalent to an implicit gradient penalty, avoiding the noise and instability caused by negative samples.

Section 04

Stabilization Mechanisms: Twin Networks and Bounded Similarity Penalty

To improve training stability, POPO introduces two innovations:

Twin Policy Networks: Two policy networks sharing parameters, where the main network updates quickly and the twin network follows with momentum smoothing to stabilize policy evolution;
Bounded Similarity Penalty: Replaces the KL divergence constraint, calculating the similarity of policy distributions in the twin network's representation space, which is more efficient and stable.

Section 05

Experimental Evidence: POPO Outperforms GRPO Across the Board

Experimental results on Qwen series models are significant:

Model	Method	AIME 2025
Qwen-Math-7B	GRPO	30.00%
Qwen-Math-7B	POPO	36.67%
POPO improves by 6.67 percentage points on AIME 2025, and ablation experiments prove that twin networks and bounded similarity penalty are necessary stabilization measures.

Section 06

Technical Significance and Future Outlook

Theoretical Aspect: Challenges the assumption in the RL field that negative samples need to be explicitly handled, inspiring research on sample efficiency; Practical Aspect: Simplifies the RLVR training process, reduces inference computation overhead by 50%, avoids negative sample selection rules, and reduces the space for hyperparameter tuning; Future Outlook: Extend to tasks such as code generation and logical reasoning, and explore combining with computation expansion during testing to enhance reasoning depth.

Section 07

Conclusion: The Value and Impact of POPO

POPO is an important advancement in the post-training field of large language models. It achieves reinforcement learning without negative samples through probability distribution normalization constraints, maintaining stability while outperforming existing methods. It not only provides a plug-and-play training improvement solution but also offers a new perspective for understanding the essential mechanism of RLVR.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15