Zing Forum

Reading

Robust Reasoning Under Noisy Supervision: Online Label Refinement Enables LLMs to Self-Correct in Mislabeled Scenarios

This paper systematically analyzes the noisy label mechanism in RLVR training, proposes the Online Label Refinement (OLR) method, gradually corrects mislabels through majority voting and dynamic consistency detection, and significantly improves model robustness even under noise ratios as high as 90%.

强化学习噪声标签推理模型标签精炼鲁棒性自我纠正
Published 2026-04-05 14:30Recent activity 2026-04-07 15:36Estimated read 5 min
Robust Reasoning Under Noisy Supervision: Online Label Refinement Enables LLMs to Self-Correct in Mislabeled Scenarios
1

Section 01

[Overview] Robust Reasoning Under Noisy Supervision: OLR Method Enables LLMs to Self-Correct in Mislabeled Scenarios

This paper addresses the noisy label problem in Reinforcement Learning with Verifiable Rewards (RLVR) training, systematically analyzes its mechanism, and proposes the Online Label Refinement (OLR) method. This method gradually corrects mislabels through majority voting and dynamic consistency detection, significantly improving model robustness even under noise ratios as high as 90%, providing a robust solution for RLVR training.

2

Section 02

Background: Dilemmas and Classification of Noisy Labels in RLVR

RLVR is an effective paradigm for training reasoning models, which provides rewards by checking the correctness of solutions via a verifier, avoiding expensive manual annotations. However, existing studies assume perfect verifier labels, while noisy labels are inevitable in reality. The research classifies noisy labels into two categories: inactive noisy labels (current policies cannot generate solutions matching the labels, reducing data efficiency) and active noisy labels (policies can generate solutions matching the labels, easily leading the model to shift toward incorrect distributions).

3

Section 03

Method: Core Mechanism of Online Label Refinement (OLR)

The core idea of OLR is to use the model's own outputs to identify and correct mislabels without additional annotation resources. Correcting a label requires two conditions: 1. The pass rate of the majority answer shows a positive slope (the model converges toward a consistent solution); 2. Historical consistency is stable (the model has high confidence in sample predictions). When these conditions are met, the original label is replaced with the majority-voted answer to achieve progressive self-correction.

4

Section 04

Experimental Validation: Robustness Performance of OLR Under High Noise

Experiments were conducted on 6 in-distribution tasks (e.g., AIME 2024/2025, AMC, etc.) and 3 out-of-distribution tasks (e.g., ARC-c, GPQA-diamond, etc.), with noise ratios ranging from 0.1 to 0.9. Results show: an average improvement of 3.6%-3.9% in-distribution and 3.3%-4.6% out-of-distribution; even under a 90% noise ratio, there are still effective improvements, proving the strong robustness of OLR.

5

Section 05

Conclusions and Practical Implications

Core contributions include: systematic analysis and classification of the noisy label mechanism in RLVR; discovery of the early correctness consistency phenomenon; proposal of the OLR method; experimental validation of its effectiveness. Practical implications: RLVR should assume the existence of noise; early intervention on noisy labels is more effective; self-supervision can improve training quality.

6

Section 06

Limitations and Future Research Directions

Current limitations: dependence on verifiers (limiting open-domain applications), computational overhead, and insufficient theoretical understanding. Future directions: expand to open-domain tasks; explore multi-agent collaborative label refinement; develop adaptive correction thresholds; strengthen theoretical analysis to optimize the method.