Section 01
[Introduction] ThinkTwice: A New Method for Jointly Optimizing LLM Reasoning and Self-Correction Capabilities
ThinkTwice is a two-stage extended training method based on Group Relative Policy Optimization (GRPO) proposed by the CSSLab research team. By first training the model to solve reasoning tasks and then training it to correct its own answers in each training cycle, it achieves the joint optimization of reasoning and self-correction capabilities without relying on external feedback mechanisms, aiming to enhance the model's autonomous learning ability and reliability.