Section 01
【Introduction】STV: Self-Trained Verifier Unlocks a New Path for Self-Improvement of Reasoning Models
STV (Self-Trained Verifier) successfully breaks through the bottleneck of self-improvement for reasoning models by using reference answers to train a verifier to identify self-generated errors. This method achieves significant results in both test-time Verification-Refinement (V-R) loops and training-time Verifier-in-the-Loop (ViL) training, opening up a new path for the self-improvement of reasoning models. The core lies in leveraging the asymmetry where "models can accurately judge errors when reference answers are available but struggle to do so without references" to distill informed verification capabilities into a reference-free verifier.