Section 01
[Introduction] ERPO: Token-level Entropy Regulation Optimizes Reasoning Capabilities of Large-scale Reasoning Models
This article introduces ERPO (Entropy Regulation Policy Optimization), a new method to improve the training of large-scale reasoning models. By identifying Critical Decision Points (CDPs) and introducing three collaborative mechanisms, ERPO addresses the problem of premature entropy collapse caused by uniform advantage allocation in GRPO, achieving higher accuracy and more concise reasoning paths in mathematical reasoning benchmark tests.