Zing Forum

Reading

DenoiseRL: Learning from Errors, a Bootstrapping Framework for Reasoning Models Without Strong Supervision

DenoiseRL is an innovative reinforcement learning framework that learns recovery strategies from the erroneous reasoning traces of weak models, eliminating reliance on strong teacher models and carefully curated datasets. It consistently outperforms existing baselines on mathematical and general reasoning benchmarks.

DenoiseRL强化学习推理模型自举训练错误恢复弱监督学习数学推理自我纠错
Published 2026-05-27 20:52Recent activity 2026-05-28 11:50Estimated read 10 min
DenoiseRL: Learning from Errors, a Bootstrapping Framework for Reasoning Models Without Strong Supervision
1

Section 01

[Introduction] DenoiseRL: A Bootstrapping Framework for Reasoning Models Without Strong Supervision

DenoiseRL: An Innovative Framework Learning from Errors

DenoiseRL is a reinforcement learning framework without strong supervision. Its core is to learn recovery strategies from the erroneous reasoning traces of weak models, getting rid of dependence on strong teacher models and carefully curated datasets. This framework consistently outperforms existing baselines on mathematical and general reasoning benchmarks. The related research was published on arXiv on May 27, 2026 (Original link: http://arxiv.org/abs/2605.28421v1).

2

Section 02

Dilemmas in Improving Reasoning Ability and Limitations of Existing Methods

Dilemmas in Improving Reasoning Ability

The training paradigm that large language models rely on for reasoning ability improvement has a fundamental contradiction: To train a stronger model, you need a stronger teacher or high-quality dataset, forming a 'chicken-and-egg' problem. All existing methods rely on strong supervision:

Method Type Core Dependence Main Limitations
Supervised Fine-tuning (SFT) Correct reasoning trajectories generated by strong teachers Limited by the upper limit of teacher's ability
RLHF Human-annotated preference data High annotation cost; hard to cover complex reasoning
PRM Step-level correctness annotations Requires a lot of manual work or strong model verification
Curriculum Learning Progressive datasets High construction cost
3

Section 03

Core Ideas and Technical Implementation of DenoiseRL

Core Ideas and Technical Implementation of DenoiseRL

Key Insights

  1. Weak model error traces contain partially correct steps and intermediate results
  2. Recovering from errors requires understanding the essence of the problem, leading to deeper learning
  3. Noisy prefixes contain learning opportunities

Three Stages of the Framework

  1. Generate noisy prefixes: Use the current weak model to generate reasoning traces with errors
  2. Recovery optimization: Train the model to identify errors, generate recovery strategies, and optimize recovery ability
  3. Iterative bootstrapping: After ability improvement, handle more complex errors to form a positive cycle

Reward and Training Strategy

  • Reward: Basic (recover to get correct answer) + Efficiency (fewer steps) + Diversity (multiple paths)
  • Training: Importance sampling (prioritize valuable errors), curriculum-based noise injection (increasing difficulty), multi-path exploration

Comparison with Traditional RL

Feature Traditional On-Policy RL DenoiseRL
Training data source Self-sampled Weak model error traces
Learning signal Final answer correctness Recovery ability
External supervision dependence Medium Low
Data efficiency Average High (errors contain more information)
Scalability Limited by own quality Can be bootstrapped to improve
4

Section 04

Experimental Results: Performance on Mathematical and General Reasoning Benchmarks

Experimental Results

Mathematical Reasoning Benchmarks

On datasets like MATH and GSM8K:

  • Consistently outperforms strong on-policy RL baselines
  • The advantage becomes more obvious when training difficulty increases
  • Shows stronger self-correction behavior

General Reasoning Benchmarks

Covers logic, common sense, code reasoning:

  • Maintains performance while significantly reducing dependence on external resources
  • Improves training efficiency, requiring fewer computing resources for the same performance

Key Findings

  1. Recovering from errors is more effective than imitating correct answers
  2. The model can be bootstrapped to improve, getting rid of external strong supervision
  3. Recovery ability can be transferred to new error types
5

Section 05

Technical Significance and Application Value of DenoiseRL

Technical Significance

Paradigm Insights

Traditional assumption: Improving reasoning requires stronger supervision signals; DenoiseRL's insight: Well-designed recovery learning can make weak supervision produce strong effects, opening up a new idea of "making good use of imperfect data".

Applicable to Resource-Constrained Scenarios

  • Open-source model catch-up: Efficiently improve reasoning ability in resource-limited projects
  • Vertical domain adaptation: Bootstrapping training in professional fields without strong teachers
  • Continuous learning: Improve from actual errors after deployment

Connection to Self-Correction Ability

The trained recovery ability is the self-correction ability: The model is better at identifying its own problems, correcting errors, and being more resilient when facing difficulties—similar to the problem-solving mode of human experts.

6

Section 06

Limitations and Future Research Directions

Limitations and Future Directions

Current Limitations

  1. Dependence on error quality: If the weak model's errors are too unreasonable, recovery is difficult
  2. Computational overhead: Generating and filtering error traces requires additional resources
  3. Limited theoretical understanding: Insufficient explanation for "learning from errors is more effective"

Future Research

  1. Adaptive noise injection: Dynamically adjust error difficulty
  2. Multi-agent DenoiseRL: Models provide error traces for each other
  3. Theoretical analysis: Sample efficiency and generalization characteristics
  4. Technology combination: Collaborate with chain-of-thought and verifiers
7

Section 07

Conclusion: The Path of Intelligent Evolution by Learning from Errors

Conclusion

DenoiseRL represents a paradigm shift: from "pursuing perfect data" to "making good use of imperfect data", proving that errors are valuable learning resources. This not only has technical value but also implies that the essence of intelligence lies in recovering from errors, just like the trial-and-error growth of human wisdom. In today's competitive reasoning model landscape, DenoiseRL provides a sustainable and scalable improvement path and may become a standard component of next-generation training.