Section 01
[Introduction] AsymGRPO: Rethinking Exploration Mechanisms in RLVR—From Entropy Regularization to Bidirectional Entropy Modulation
This article introduces the AsymGRPO framework, which decomposes policy entropy into 'informational entropy' (beneficial uncertainty) and 'spurious entropy' (unhelpful noise) to enable differential modulation of positive and negative samples. This addresses the exploration limitation issue of large language models in Reinforcement Learning with Verifiable Rewards (RLVR), enhancing their reasoning ability and generalization performance.