Section 01
[Introduction] Probing the Self-Verification Capability of Reasoning Models: Identifying Answer Correctness via Hidden States
This study achieves the prediction of answer correctness by probing the hidden states of reasoning models and training lightweight classification detectors, providing new ideas for improving the reliability and self-correction capabilities of reasoning models. Key findings include that hidden states contain correctness signals and detectors have strong cross-model generalization capabilities. These can be applied to attach credibility scores to model answers, facilitating their use in high-risk scenarios.