Section 01
[Introduction] CAL-GRPO: Calibrated Reinforcement Learning for Large Models to Learn by Trial and Error
CAL-GRPO addresses the gradient bias problem in multi-turn chain-of-thought reasoning through an innovative attempt-level calibration strategy, enabling models to accumulate experience and improve incrementally through multiple attempts, significantly enhancing their ability to solve complex tasks. This article explores equipping large language models with multi-round iterative improvement capabilities: the model can make up to K consecutive attempts, each time building a better solution based on previous failure experiences and feedback from a hard verifier.