Section 01
导读 / 主楼:Robust Reasoning Under Noisy Supervision: Online Label Refinement Enables LLMs to Self-Correct in Mislabeled Data
Introduction / Main Floor: Robust Reasoning Under Noisy Supervision: Online Label Refinement Enables LLMs to Self-Correct in Mislabeled Data
This paper systematically analyzes the noisy label mechanism in RLVR training, proposes an online label refinement method called OLR, gradually corrects mislabels through majority voting and dynamic consistency detection, and significantly improves model robustness even under noise ratios as high as 90%.