Zing Forum

Reading

ThinkTwice: Jointly Optimizing Reasoning and Self-Correction Capabilities of Large Language Models

ThinkTwice is a two-stage extended training method based on GRPO. By first training the model to solve reasoning tasks and then training it to correct its own answers in each training cycle, it achieves the joint optimization of reasoning and self-correction capabilities.

LLMreasoningself-refinementGRPOtrainingmath
Published 2026-04-22 22:05Recent activity 2026-04-22 22:20Estimated read 7 min
ThinkTwice: Jointly Optimizing Reasoning and Self-Correction Capabilities of Large Language Models
1

Section 01

[Introduction] ThinkTwice: A New Method for Jointly Optimizing LLM Reasoning and Self-Correction Capabilities

ThinkTwice is a two-stage extended training method based on Group Relative Policy Optimization (GRPO) proposed by the CSSLab research team. By first training the model to solve reasoning tasks and then training it to correct its own answers in each training cycle, it achieves the joint optimization of reasoning and self-correction capabilities without relying on external feedback mechanisms, aiming to enhance the model's autonomous learning ability and reliability.

2

Section 02

Research Background and Challenges

Large language models have made significant progress in complex tasks such as mathematical reasoning and code generation, but they have two key limitations: initial reasoning is prone to errors, and it is difficult to effectively identify and correct their own mistakes. Existing methods often handle reasoning and self-correction training separately or rely on external feedback mechanisms, which increases system complexity and limits the model's autonomous learning ability. The ThinkTwice project aims to simultaneously improve these two capabilities through a single training framework, enabling the model to learn to "think twice"—first generate an answer, then actively correct it.

3

Section 03

Core Method: Two-Stage Joint Training

The core innovation of ThinkTwice is dividing each training cycle into two stages:

  1. Reasoning task training: Learn to solve math competition problems, logical reasoning problems, etc., similar to traditional RLHF training, optimizing strategies through rewards based on the correctness of generated answers;
  2. Self-correction training: Correct the answers generated in the first stage, with rewards based on answer correctness, no need for external evaluation models or manual annotations, internalizing the "check-correct" thinking mode to form a self-improvement loop. Both stages use consistent reward signals to avoid the complexity of multi-objective optimization and ensure the synergistic improvement of the two capabilities.
4

Section 04

Technical Implementation and Experimental Setup

The project is implemented based on the VErl framework, supporting open-source models such as Qwen3-4B-Instruct and OLMo-3-7B-Instruct. Training scripts and weights can be downloaded via Hugging Face. Hardware requires at least 2 NVIDIA GPUs (official tests use A100/H100), and software requires Linux system, CUDA 12.x, and conda. Evaluation benchmarks include mathematical reasoning datasets like MATH500, AIME2024, and AMC. The training script runs with one click, automatically activating the conda environment, configuring Ray distributed training, and using Hydra to manage hyperparameters, reducing the threshold for reproduction.

5

Section 05

Evaluation Methods and Experimental Results

ThinkTwice uses multi-dimensional evaluation:

  • Pass@k evaluation: Generate multiple samples to calculate pass rates for different k values, comparing the performance difference between the original answer and the corrected answer;
  • Cross-model correction evaluation: Test the model's improvement effect on answers generated by other models to verify the transferability of correction capabilities. Experimental results show that the trained model can effectively identify its own errors, and the quality of corrected answers is significantly improved, which is of great value for high-reliability scenarios such as educational tutoring and scientific research assistance.
6

Section 06

Application Value and Insights

Insights from the ThinkTwice methodology:

  1. Training efficiency: Joint optimization avoids resource waste from training reasoning and correction models separately;
  2. Autonomous capability: Self-correction capability does not rely on external systems, reducing deployment complexity;
  3. Interpretability: The two-stage training process is clear, making it easy to analyze the behavioral differences between the model's reasoning and correction stages;
  4. Generalization potential: Can be extended to task domains requiring self-verification, such as code generation, text summarization, and question-answering systems.
7

Section 07

Quick Start and Usage Guide

The project repository provides detailed documentation and example scripts. Steps: Prepare evaluation datasets → Download base model weights → Run training scripts. Developers can directly download pre-trained models from Hugging Face for inference testing. ThinkTwice provides a new idea for improving LLM reliability and is expected to promote self-correction capability as a standard configuration for the next generation of LLMs.