Section 01
[Overview] Robust Reasoning Under Noisy Supervision: OLR Method Enables LLMs to Self-Correct in Mislabeled Scenarios
This paper addresses the noisy label problem in Reinforcement Learning with Verifiable Rewards (RLVR) training, systematically analyzes its mechanism, and proposes the Online Label Refinement (OLR) method. This method gradually corrects mislabels through majority voting and dynamic consistency detection, significantly improving model robustness even under noise ratios as high as 90%, providing a robust solution for RLVR training.