Zing Forum

Reading

Fail2Fix-RL: A Lightweight Reinforcement Learning Framework for Small Models to Learn Self-Correction from Failures

Fail2Fix-RL is a lightweight framework for training small models' reasoning capabilities. It enables models to learn self-checking and correction by replaying failed reasoning trajectories online and introducing a verifiable reward mechanism.

LLMreasoningRLVRself-correctionGRPOmath reasoningCIPOsmall model
Published 2026-05-31 19:44Recent activity 2026-05-31 19:50Estimated read 5 min
Fail2Fix-RL: A Lightweight Reinforcement Learning Framework for Small Models to Learn Self-Correction from Failures
1

Section 01

Introduction / Main Floor: Fail2Fix-RL: A Lightweight Reinforcement Learning Framework for Small Models to Learn Self-Correction from Failures

Fail2Fix-RL is a lightweight framework for training small models' reasoning capabilities. It enables models to learn self-checking and correction by replaying failed reasoning trajectories online and introducing a verifiable reward mechanism.

2

Section 02

Original Authors and Source

  • Original Author/Maintainer: KangarooKi
  • Source Platform: GitHub
  • Original Project Title: Fail2Fix-RL: Learning to correct from failed reasoning rollouts
  • Original Link: https://github.com/KangarooKi/Fail2Fix-RL
  • Publication Date: 2026-05-31
3

Section 03

Why Do We Need Fail2Fix-RL?

Traditional Reinforcement Learning with Verifiable Rewards (RLVR) usually provides sparse binary feedback for mathematical reasoning tasks: a reasoning trajectory is either completely correct or completely wrong. While this signal is objective, it wastes the rich information contained in failed attempts. A near-correct reasoning, an arithmetic error, or a completely irrelevant solution are all treated the same under the binary reward system.

The core insight of Fail2Fix-RL is: the wrong solutions generated by the model are themselves valuable training materials. Instead of directly discarding failed reasoning trajectories, we re-input them into the model to train it to identify errors, retain correct parts, and repair the wrong parts.

4

Section 04

Core Method: Dual-Path Online Training

Each online RL step of Fail2Fix-RL includes two parallel training streams:

5

Section 05

Base Reasoning Stream (Base Rollouts)

The model receives the original problem, generates multiple reasoning trajectories (rollouts), and then scores them via a deterministic mathematical verifier. This process follows the group advantage estimation style of GRPO (Group Relative Policy Optimization).

6

Section 06

Correction Training Stream (Correction Replay)

Candidate solutions are selected from the trajectories generated by the current policy to construct potentially wrong correction prompts, then the model is trained to:

  1. Check: Identify potential issues in the solution
  2. Preserve: Retain the correct parts of the solution
  3. Repair: Fix the wrong parts

The corrected trajectories are also scored by the verifier, and a risk-aware reward shaping mechanism is introduced—if the model modifies an originally correct solution to a wrong one, it will receive an additional penalty.

7

Section 07

Online Correction Replay

The correction prompts are constructed from the trajectories generated by the current policy itself, which means when the model learns to correct, it faces the types of errors it actually makes. This self-correction training method is more close to real deployment scenarios than using static datasets.

8

Section 08

Difficulty-Aware Selection

During training, problems that contain both successful and failed trajectories are prioritized as correction training materials. Such problems are usually at the model's capability boundary—neither too easy (always correct) nor too hard (always wrong)—and are the most valuable learning samples.