# DRPO: Rethinking Divergence Regularization in LLM Reinforcement Learning

> DRPO replaces hard masks with a smooth advantage-weighted quadratic regularizer, maintaining the trust region geometry while providing continuous gradient weights, significantly improving the stability and efficiency of reinforcement learning training for large language models.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-08T17:58:23.000Z
- 最近活动: 2026-06-09T04:51:03.795Z
- 热度: 129.1
- 关键词: 强化学习, PPO, 信任区域, 策略优化, RLHF, 模型对齐, 梯度正则化
- 页面链接: https://www.zingnex.cn/en/forum/thread/drpo-llm
- Canonical: https://www.zingnex.cn/forum/thread/drpo-llm
- Markdown 来源: floors_fallback

---

## DRPO: Introduction to Rethinking Divergence Regularization in LLM Reinforcement Learning

**Key Highlights of DRPO**
DRPO (Divergence Regularized Policy Optimization) addresses the trust region control problem in LLM reinforcement learning by proposing to replace hard masks with a smooth advantage-weighted quadratic regularizer. It maintains the trust region geometry while providing continuous gradient weights, significantly improving training stability and efficiency. This article will analyze it from dimensions such as background, methodology, and experimental validation.

## Challenges of LLM Reinforcement Learning and Limitations of Existing Methods

## Challenges of LLM Reinforcement Learning and Limitations of Existing Methods
Reinforcement Learning (RL) is a core component of LLM post-training, used for instruction following, safety alignment, etc. However, off-policy training leads to distribution mismatch, making trust region control crucial.
Existing methods like PPO use ratio clipping to approximate the trust region, but the distribution shift on long-tailed vocabularies is not accurately reflected; DPPO replaces clipping with divergence masks but relies on hard masks (gradients of out-of-bound tokens are completely discarded), which easily leads to training issues.

## Core Innovation of DRPO: Smooth Regularization Replaces Hard Masks

## Core Innovation of DRPO: Smooth Regularization Replaces Hard Masks
The key improvement of DRPO is replacing hard masks with a **smooth advantage-weighted quadratic regularizer**: 
1. Maintains the same trust region geometry as DPPO to prevent excessive policy deviation;
2. Generates bounded continuous gradient weights, attenuating divergent updates while providing correction signals;
3. Avoids the "black-or-white" rough decisions of hard masks, improving training stability.

## Technical Details of DRPO: Mathematical Design of Soft Regularization

## Technical Details of DRPO: Mathematical Design of Soft Regularization
DRPO penalizes policy deviation through a quadratic regularization term, and the advantage weighting mechanism ensures that only tokens affecting target performance are strictly constrained. Unlike hard masks, soft regularization allows out-of-bound tokens to contribute gradients with attenuated weights and provides correction signals to pull back to the trust region, avoiding getting stuck in local optima in the early stages of training.

## Experimental Validation: Improved Stability and Efficiency Across Scales

## Experimental Validation: Improved Stability and Efficiency Across Scales
Experiments cover different model scales, architectures, and precision settings, and the results show:
- Reduced training variance, with smoother learning curves;
- Fewer training steps to reach target performance;
- Simple design, easy to integrate into existing RLHF and inference optimization processes.

## Practical Significance and Recommendations for DRPO

## Practical Significance and Recommendations for DRPO
DRPO proves that smoothness is superior to hard constraints in optimization algorithms. Recommendations for LLM post-training practitioners:
- Try applying DRPO to your next training task;
- Its concept of "continuous regularization replacing discrete masks" can inspire improvements in other algorithms.
