Zing Forum

Reading

Yale NLP Proposes a New Framework for Quantifying the Faithfulness of Confidence Expressions in Reasoning Models

Yale University's NLP Lab has open-sourced the faithful_lrm project, proposing a systematic framework to evaluate whether the confidence expressions of Large Reasoning Models (LRMs) in chain-of-thought reflect their internal uncertainty truthfully, and revealing key challenges in confidence calibration for current reasoning models.

大型推理模型置信度校准思维链不确定性量化AI可解释性模型评估耶鲁大学开源工具
Published 2026-06-04 01:26Recent activity 2026-06-04 01:50Estimated read 7 min
Yale NLP Proposes a New Framework for Quantifying the Faithfulness of Confidence Expressions in Reasoning Models
1

Section 01

Yale NLP Open-Sources faithful_lrm Framework, Focusing on Evaluating the Faithfulness of Confidence Expressions in Large Reasoning Models

Yale University's NLP Lab has open-sourced the faithful_lrm project, proposing a systematic framework to evaluate whether the confidence expressions of Large Reasoning Models (LRMs) in chain-of-thought reflect their internal uncertainty truthfully, and revealing key challenges in confidence calibration for current reasoning models. The framework aims to enhance the reliability and safety of AI systems.

2

Section 02

Research Background and Motivation

Large reasoning models (e.g., DeepSeek-R1, QwQ) often express linguistic confidence (such as "I am very confident") when solving complex tasks via chain-of-thought, but a core issue has been overlooked: do these expressions truthfully reflect internal cognitive uncertainty? The faithfulness of confidence expressions is crucial for AI reliability—overconfidence may lead to user trust risks, while excessive modesty reduces practical value.

3

Section 03

Core Methodology

The framework quantifies the faithfulness of confidence expressions from three dimensions:

  1. Representation-based Confidence: Analyze the activation patterns of the model's hidden layers and extract internal uncertainty using the DeepConf metric;
  2. Token Probability-based Confidence: Use token log probabilities and aggregate chain-of-thought probability information via the RCC metric;
  3. Sampling Consistency-based Confidence: Sample continuation results multiple times and measure confidence by output consistency. Additionally, use Gemini-2.5-Flash to score the linguistic decisiveness of reasoning trajectories and calculate the "faithfulness gap" with internal confidence.
4

Section 04

Experimental Design and Datasets

The experiments cover multiple reasoning-intensive benchmarks: AIME (mathematical reasoning), HLE (comprehensive reasoning), SuperGPQA (scientific QA), LegalBench (legal reasoning), and MuSR (multi-step reasoning). The tested models include the DeepSeek-R1-Distill series and Qwen/QwQ series, with parameter sizes ranging from 7B to 32B.

5

Section 05

Key Findings

The study得出 four key findings:

  1. Reasoning ability ≠ Confidence calibration: There is no necessary connection between a model's reasoning performance and the faithfulness of its confidence expressions; training objectives focus on correctness rather than calibration;
  2. Limited effect of prompt interventions: Strategies like perceptual language and metacognitive hedging prompts cannot reliably fix calibration issues;
  3. Significant divergence among confidence estimators: The three internal estimators (representation, probability, sampling) show large differences in evaluation results for the same trajectory;
  4. High-confidence errors are common: Models often exhibit high linguistic confidence even when giving wrong answers, posing a misleading risk.
6

Section 06

Technical Implementation and Open-Source Contributions

The project open-sources a complete experimental framework:

  • Experiment Generation Module: GPU inference pipeline (vLLM/HuggingFace), decisiveness scoring scripts, implementations of the three confidence estimators, dataset loaders;
  • Analysis Module: Visualization scripts (scatter plots, heatmaps, etc.), clustering/binning analysis, interactive HTML dashboard generation.
7

Section 07

Practical Implications and Recommendations

Recommendations for developers:

  1. Multi-dimensional monitoring: Combine representation, probability, and other metrics instead of relying on a single linguistic confidence;
  2. Calibration training: Add explicit calibration objectives during training instead of only optimizing accuracy;
  3. Human-machine collaboration: Trigger manual review when confidence signals are inconsistent in critical scenarios. For researchers: The framework provides a benchmark tool for evaluating the reliability of reasoning models and promotes the development of more faithful and transparent AI systems.
8

Section 08

Conclusion

This study reveals the fundamental challenges in the self-cognitive expressions of large reasoning models. As LRMs are increasingly applied in high-risk fields (such as scientific discovery and medical diagnosis), solving the problem of confidence faithfulness is key to ensuring AI trustworthiness. The open-source project provides research tools and empirical foundations for academia and industry.