Zing Forum

Reading

STV: Self-Trained Verifier Unlocks Self-Improvement Capabilities of Reasoning Models

By using reference answers to train a verifier to identify self-generated errors, STV achieves significant results in both test-time V-R loops and training-time ViL training, opening up a new path for the self-improvement of reasoning models.

自训练验证器验证-精化循环Verifier-in-the-Loop推理模型自我改进强化学习
Published 2026-05-29 01:40Recent activity 2026-05-29 14:27Estimated read 7 min
STV: Self-Trained Verifier Unlocks Self-Improvement Capabilities of Reasoning Models
1

Section 01

【Introduction】STV: Self-Trained Verifier Unlocks a New Path for Self-Improvement of Reasoning Models

STV (Self-Trained Verifier) successfully breaks through the bottleneck of self-improvement for reasoning models by using reference answers to train a verifier to identify self-generated errors. This method achieves significant results in both test-time Verification-Refinement (V-R) loops and training-time Verifier-in-the-Loop (ViL) training, opening up a new path for the self-improvement of reasoning models. The core lies in leveraging the asymmetry where "models can accurately judge errors when reference answers are available but struggle to do so without references" to distill informed verification capabilities into a reference-free verifier.

2

Section 02

【Background】Dual Dilemmas and Core Bottlenecks in Self-Improvement of Reasoning Models

Reasoning models face bottlenecks in two key scenarios for self-improvement:

  1. Test-time: V-R loops easily get stuck due to inflated verifier scores and vague feedback;
  2. Training-time: Self-training with incorrect data leads to performance degradation. The common core issue for both is verifier quality—lack of training signals to capture self-generated errors, yet the required capability is exactly the target to be trained.
3

Section 03

【Methodology】Core Insights and Implementation Mechanisms of STV

Core Insights

Models can accurately judge the correctness of self-generated answers when reference answers are available, but tend to overestimate quality without references. STV leverages this asymmetry as a supervision signal.

Training Process

  1. Generate candidate answers; 2. Obtain reference answers; 3. Use judgments with references as supervision targets; 4. Train the verifier to replicate the ability to judge without references.

Key Techniques

Distill "reference-based verification capabilities" into a reference-free verifier, compatible with architectures like result verifiers, process verifiers, and critique models.

4

Section 04

【Evidence】Significant Effects of STV in Test and Training Phases

Test-time Effects

  • Compared to methods like SFT, RL on verifier scores, and Meta-verifiers, STV shows significant breakthroughs in difficult tasks;
  • The accuracy of hard math problems doubles, and scientific reasoning tasks increase from 1.5% to 21% (a 14-fold improvement).

Training-time Effects (ViL Training)

  • Starting from the standard RL convergence point, ViL further improves pass@1 by 33%;
  • After training, the generator's independent pass@1 (without a verifier) is still 30% higher than standard RL (internalized reasoning strategies).
5

Section 05

【Conclusion】Deep Insights and Methodological Advantages of STV

Deep Insights

Verifiers can serve as effective teachers for generators: Standard RL reward signals are sparse and delayed, while ViL provides process-level, actionable feedback and high-quality data filtering, enabling adaptive curriculum learning.

Methodological Advantages

  • Data efficiency: No additional manual annotation required;
  • Versatility: Compatible with any generator/verifier architecture;
  • Stackable effects: Further improvement on top of standard RL;
  • Interpretability: Feedback includes specific error analysis.
6

Section 06

【Outlook】Limitations of STV and Future Research Directions

Limitations

  • Dependence on high-quality reference answers;
  • Need for matching capabilities between verifier and generator;
  • High computational cost of ViL training.

Future Directions

  • Iterative STV (mutual improvement between generator and verifier);
  • Transfer of multi-task verification capabilities;
  • Integration with process reward models and Monte Carlo Tree Search;
  • Theoretical analysis of the relationship between verifier quality and generator improvement.
7

Section 07

【Summary】Significance of STV for Self-Improvement of Reasoning Models

By cleverly leveraging the asymmetry of reference answers, STV unlocks the self-improvement capabilities of reasoning models during both testing and training. The "internalization effect" of ViL training redefines the role of verifiers—from auxiliary components to core driving forces of training. This method provides a feasible path for building continuously self-improving AI systems, reminding researchers to value the complementary relationship between verification and generation capabilities.