Section 01
[Introduction] Entropy-Gradient Reversal: New Discoveries and Optimization Methods for the Internal Mechanisms of Large Reasoning Models
A paper published in May 2026 identifies a robust negative correlation between token entropy and logit gradients in Large Reasoning Models (LRMs) — termed "Entropy-Gradient Reversal" — and defines it as a geometric fingerprint of reasoning ability. Based on this discovery, the research team proposes the CorR-PO method, which embeds this reversal feature into the reward regularization of Reinforcement Learning (RL). It achieves better performance than existing methods on multiple reasoning benchmarks and improves training stability.