Zing Forum

Reading

Probing the Self-Verification Capability of Reasoning Models: Identifying Answer Correctness via Hidden States

This study achieves the prediction of the correctness of model answers by probing the hidden states of reasoning models, providing new insights for improving the reliability and self-correction capabilities of reasoning models.

推理模型自我验证隐藏状态探测思维链模型可解释性答案正确性预测
Published 2026-05-14 14:45Recent activity 2026-05-14 14:48Estimated read 6 min
Probing the Self-Verification Capability of Reasoning Models: Identifying Answer Correctness via Hidden States
1

Section 01

[Introduction] Probing the Self-Verification Capability of Reasoning Models: Identifying Answer Correctness via Hidden States

This study achieves the prediction of answer correctness by probing the hidden states of reasoning models and training lightweight classification detectors, providing new ideas for improving the reliability and self-correction capabilities of reasoning models. Key findings include that hidden states contain correctness signals and detectors have strong cross-model generalization capabilities. These can be applied to attach credibility scores to model answers, facilitating their use in high-risk scenarios.

2

Section 02

Research Background: Reliability Challenges of Reasoning Models

With the rise of reasoning models like DeepSeek-R1, large language models have performed well in complex tasks such as mathematical reasoning and code generation, but they have the problem of "confidently making mistakes"—they still give seemingly reasonable answers even when reasoning is wrong, which restricts their application in high-risk scenarios. Developing self-verification mechanisms has become a key direction to improve the practicality of models.

3

Section 03

Core Method: Technical Route for Hidden State Probing

The study designs a complete probing process: 1. Chain-of-Thought Generation and Segmentation: The model generates a reasoning chain and splits it into logical paragraphs; 2. Intermediate Answer Extraction and Annotation: Use tools like Gemini API to extract intermediate answers and annotate their correctness; 3. Hidden State Extraction: Obtain the last-layer hidden state of each paragraph; 4. Detector Training: Train a binary classification model based on hidden states and labels, and optimize hyperparameters via grid search.

4

Section 04

Experimental Results: Cross-Model Generalization and Practical Application Value

Experimental verification shows: 1. Cross-Model Generalization: Detectors trained on one model can be transferred to other models, sharing similar internal representation patterns; 2. Best Performance on MATH Dataset: Mathematical reasoning tasks are more likely to trigger self-verification mechanisms or have structural characteristics that facilitate judgment; 3. Application Value: Credibility scores can be attached without increasing reasoning costs, and strategies like re-reasoning or manual review can be triggered when errors are predicted.

5

Section 05

Technical Implementation: Open-Source Code and Pre-Trained Resources

The research team has open-sourced the full-process code (data preprocessing, training, evaluation), supporting mainstream models like DeepSeek-R1-Distill-Qwen; provides pre-trained detectors (covering multi-model dataset combinations), with detectors trained on MATH data having better generalization; the codebase is modularly designed, allowing flexible replacement of models, datasets, and metrics to support customized experiments.

6

Section 06

Research Implications: Self-Verification and the Development of Reasoning Models

This study implies: 1. Self-verification is a key component of reasoning ability, and future models should explicitly integrate this mechanism; 2. Hidden state probing provides a new perspective for model interpretability, which can reveal reasoning decision nodes; 3. Reliable self-verification empowers human-machine collaboration, allowing users to focus on reviewing low-credibility cases.

7

Section 07

Limitations and Future Research Directions

Current limitations: Relying on external tools to extract and annotate intermediate answers may introduce errors; performance varies across different problem types (accuracy in multi-hop/common sense reasoning needs improvement). Future directions: Develop end-to-end self-verification training objectives; explore fine-grained error localization mechanisms; combine technologies like active learning and continuous learning.