Zing Forum

Reading

DRPO: Rethinking Divergence Regularization in LLM Reinforcement Learning

DRPO replaces hard masks with a smooth advantage-weighted quadratic regularizer, maintaining the trust region geometry while providing continuous gradient weights, significantly improving the stability and efficiency of reinforcement learning training for large language models.

强化学习PPO信任区域策略优化RLHF模型对齐梯度正则化
Published 2026-06-09 01:58Recent activity 2026-06-09 12:51Estimated read 5 min
DRPO: Rethinking Divergence Regularization in LLM Reinforcement Learning
1

Section 01

DRPO: Introduction to Rethinking Divergence Regularization in LLM Reinforcement Learning

Key Highlights of DRPO DRPO (Divergence Regularized Policy Optimization) addresses the trust region control problem in LLM reinforcement learning by proposing to replace hard masks with a smooth advantage-weighted quadratic regularizer. It maintains the trust region geometry while providing continuous gradient weights, significantly improving training stability and efficiency. This article will analyze it from dimensions such as background, methodology, and experimental validation.

2

Section 02

Challenges of LLM Reinforcement Learning and Limitations of Existing Methods

Challenges of LLM Reinforcement Learning and Limitations of Existing Methods

Reinforcement Learning (RL) is a core component of LLM post-training, used for instruction following, safety alignment, etc. However, off-policy training leads to distribution mismatch, making trust region control crucial. Existing methods like PPO use ratio clipping to approximate the trust region, but the distribution shift on long-tailed vocabularies is not accurately reflected; DPPO replaces clipping with divergence masks but relies on hard masks (gradients of out-of-bound tokens are completely discarded), which easily leads to training issues.

3

Section 03

Core Innovation of DRPO: Smooth Regularization Replaces Hard Masks

Core Innovation of DRPO: Smooth Regularization Replaces Hard Masks

The key improvement of DRPO is replacing hard masks with a smooth advantage-weighted quadratic regularizer:

  1. Maintains the same trust region geometry as DPPO to prevent excessive policy deviation;
  2. Generates bounded continuous gradient weights, attenuating divergent updates while providing correction signals;
  3. Avoids the "black-or-white" rough decisions of hard masks, improving training stability.
4

Section 04

Technical Details of DRPO: Mathematical Design of Soft Regularization

Technical Details of DRPO: Mathematical Design of Soft Regularization

DRPO penalizes policy deviation through a quadratic regularization term, and the advantage weighting mechanism ensures that only tokens affecting target performance are strictly constrained. Unlike hard masks, soft regularization allows out-of-bound tokens to contribute gradients with attenuated weights and provides correction signals to pull back to the trust region, avoiding getting stuck in local optima in the early stages of training.

5

Section 05

Experimental Validation: Improved Stability and Efficiency Across Scales

Experimental Validation: Improved Stability and Efficiency Across Scales

Experiments cover different model scales, architectures, and precision settings, and the results show:

  • Reduced training variance, with smoother learning curves;
  • Fewer training steps to reach target performance;
  • Simple design, easy to integrate into existing RLHF and inference optimization processes.
6

Section 06

Practical Significance and Recommendations for DRPO

Practical Significance and Recommendations for DRPO

DRPO proves that smoothness is superior to hard constraints in optimization algorithms. Recommendations for LLM post-training practitioners:

  • Try applying DRPO to your next training task;
  • Its concept of "continuous regularization replacing discrete masks" can inspire improvements in other algorithms.