Reading

Robust Reasoning Under Noisy Supervision: Online Label Refinement Enables LLMs to Self-Correct in Mislabeled Scenarios

This paper systematically analyzes the noisy label mechanism in RLVR training, proposes the Online Label Refinement (OLR) method, gradually corrects mislabels through majority voting and dynamic consistency detection, and significantly improves model robustness even under noise ratios as high as 90%.

强化学习噪声标签推理模型标签精炼鲁棒性自我纠正

Published 2026-04-05 14:30Recent activity 2026-04-07 15:36Estimated read 5 min

Robust Reasoning Under Noisy Supervision: Online Label Refinement Enables LLMs to Self-Correct in Mislabeled Scenarios

Section 01

[Overview] Robust Reasoning Under Noisy Supervision: OLR Method Enables LLMs to Self-Correct in Mislabeled Scenarios

This paper addresses the noisy label problem in Reinforcement Learning with Verifiable Rewards (RLVR) training, systematically analyzes its mechanism, and proposes the Online Label Refinement (OLR) method. This method gradually corrects mislabels through majority voting and dynamic consistency detection, significantly improving model robustness even under noise ratios as high as 90%, providing a robust solution for RLVR training.

Section 02

Background: Dilemmas and Classification of Noisy Labels in RLVR

RLVR is an effective paradigm for training reasoning models, which provides rewards by checking the correctness of solutions via a verifier, avoiding expensive manual annotations. However, existing studies assume perfect verifier labels, while noisy labels are inevitable in reality. The research classifies noisy labels into two categories: inactive noisy labels (current policies cannot generate solutions matching the labels, reducing data efficiency) and active noisy labels (policies can generate solutions matching the labels, easily leading the model to shift toward incorrect distributions).

Section 03

Method: Core Mechanism of Online Label Refinement (OLR)

The core idea of OLR is to use the model's own outputs to identify and correct mislabels without additional annotation resources. Correcting a label requires two conditions: 1. The pass rate of the majority answer shows a positive slope (the model converges toward a consistent solution); 2. Historical consistency is stable (the model has high confidence in sample predictions). When these conditions are met, the original label is replaced with the majority-voted answer to achieve progressive self-correction.

Section 04

Experimental Validation: Robustness Performance of OLR Under High Noise

Experiments were conducted on 6 in-distribution tasks (e.g., AIME 2024/2025, AMC, etc.) and 3 out-of-distribution tasks (e.g., ARC-c, GPQA-diamond, etc.), with noise ratios ranging from 0.1 to 0.9. Results show: an average improvement of 3.6%-3.9% in-distribution and 3.3%-4.6% out-of-distribution; even under a 90% noise ratio, there are still effective improvements, proving the strong robustness of OLR.

Section 05

Conclusions and Practical Implications

Core contributions include: systematic analysis and classification of the noisy label mechanism in RLVR; discovery of the early correctness consistency phenomenon; proposal of the OLR method; experimental validation of its effectiveness. Practical implications: RLVR should assume the existence of noise; early intervention on noisy labels is more effective; self-supervision can improve training quality.

Section 06

Limitations and Future Research Directions

Current limitations: dependence on verifiers (limiting open-domain applications), computational overhead, and insufficient theoretical understanding. Future directions: expand to open-domain tasks; explore multi-agent collaborative label refinement; develop adaptive correction thresholds; strengthen theoretical analysis to optimize the method.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15