Section 01
DRPO: Introduction to Rethinking Divergence Regularization in LLM Reinforcement Learning
Key Highlights of DRPO DRPO (Divergence Regularized Policy Optimization) addresses the trust region control problem in LLM reinforcement learning by proposing to replace hard masks with a smooth advantage-weighted quadratic regularizer. It maintains the trust region geometry while providing continuous gradient weights, significantly improving training stability and efficiency. This article will analyze it from dimensions such as background, methodology, and experimental validation.