# Yale NLP Proposes a New Framework for Quantifying the Faithfulness of Confidence Expressions in Reasoning Models

> Yale University's NLP Lab has open-sourced the faithful_lrm project, proposing a systematic framework to evaluate whether the confidence expressions of Large Reasoning Models (LRMs) in chain-of-thought reflect their internal uncertainty truthfully, and revealing key challenges in confidence calibration for current reasoning models.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-03T17:26:21.000Z
- 最近活动: 2026-06-03T17:50:17.309Z
- 热度: 159.6
- 关键词: 大型推理模型, 置信度校准, 思维链, 不确定性量化, AI可解释性, 模型评估, 耶鲁大学, 开源工具
- 页面链接: https://www.zingnex.cn/en/forum/thread/nlp-e0ec0eec
- Canonical: https://www.zingnex.cn/forum/thread/nlp-e0ec0eec
- Markdown 来源: floors_fallback

---

## Yale NLP Open-Sources faithful_lrm Framework, Focusing on Evaluating the Faithfulness of Confidence Expressions in Large Reasoning Models

Yale University's NLP Lab has open-sourced the faithful_lrm project, proposing a systematic framework to evaluate whether the confidence expressions of Large Reasoning Models (LRMs) in chain-of-thought reflect their internal uncertainty truthfully, and revealing key challenges in confidence calibration for current reasoning models. The framework aims to enhance the reliability and safety of AI systems.

## Research Background and Motivation

Large reasoning models (e.g., DeepSeek-R1, QwQ) often express linguistic confidence (such as "I am very confident") when solving complex tasks via chain-of-thought, but a core issue has been overlooked: do these expressions truthfully reflect internal cognitive uncertainty? The faithfulness of confidence expressions is crucial for AI reliability—overconfidence may lead to user trust risks, while excessive modesty reduces practical value.

## Core Methodology

The framework quantifies the faithfulness of confidence expressions from three dimensions:
1. **Representation-based Confidence**: Analyze the activation patterns of the model's hidden layers and extract internal uncertainty using the DeepConf metric;
2. **Token Probability-based Confidence**: Use token log probabilities and aggregate chain-of-thought probability information via the RCC metric;
3. **Sampling Consistency-based Confidence**: Sample continuation results multiple times and measure confidence by output consistency.
Additionally, use Gemini-2.5-Flash to score the linguistic decisiveness of reasoning trajectories and calculate the "faithfulness gap" with internal confidence.

## Experimental Design and Datasets

The experiments cover multiple reasoning-intensive benchmarks: AIME (mathematical reasoning), HLE (comprehensive reasoning), SuperGPQA (scientific QA), LegalBench (legal reasoning), and MuSR (multi-step reasoning). The tested models include the DeepSeek-R1-Distill series and Qwen/QwQ series, with parameter sizes ranging from 7B to 32B.

## Key Findings

The study得出 four key findings:
1. **Reasoning ability ≠ Confidence calibration**: There is no necessary connection between a model's reasoning performance and the faithfulness of its confidence expressions; training objectives focus on correctness rather than calibration;
2. **Limited effect of prompt interventions**: Strategies like perceptual language and metacognitive hedging prompts cannot reliably fix calibration issues;
3. **Significant divergence among confidence estimators**: The three internal estimators (representation, probability, sampling) show large differences in evaluation results for the same trajectory;
4. **High-confidence errors are common**: Models often exhibit high linguistic confidence even when giving wrong answers, posing a misleading risk.

## Technical Implementation and Open-Source Contributions

The project open-sources a complete experimental framework:
- **Experiment Generation Module**: GPU inference pipeline (vLLM/HuggingFace), decisiveness scoring scripts, implementations of the three confidence estimators, dataset loaders;
- **Analysis Module**: Visualization scripts (scatter plots, heatmaps, etc.), clustering/binning analysis, interactive HTML dashboard generation.

## Practical Implications and Recommendations

Recommendations for developers:
1. **Multi-dimensional monitoring**: Combine representation, probability, and other metrics instead of relying on a single linguistic confidence;
2. **Calibration training**: Add explicit calibration objectives during training instead of only optimizing accuracy;
3. **Human-machine collaboration**: Trigger manual review when confidence signals are inconsistent in critical scenarios.
For researchers: The framework provides a benchmark tool for evaluating the reliability of reasoning models and promotes the development of more faithful and transparent AI systems.

## Conclusion

This study reveals the fundamental challenges in the self-cognitive expressions of large reasoning models. As LRMs are increasingly applied in high-risk fields (such as scientific discovery and medical diagnosis), solving the problem of confidence faithfulness is key to ensuring AI trustworthiness. The open-source project provides research tools and empirical foundations for academia and industry.
