Zing Forum

Reading

CDUR: How Chain-of-Thought Budget in Large Language Models Triggers Overconfidence—A Deep Analysis of Calibration Drift Phenomenon

This article deeply analyzes the CDUR (Calibration Drift Under Reasoning) phenomenon, reveals the non-monotonic change pattern of calibration error when increasing the reasoning budget of large language models, and introduces the CABStop calibration-aware stopping rule.

大语言模型思维链校准漂移CDUR过度自信ECECABStop推理预算Llama机器学习
Published 2026-06-11 15:15Recent activity 2026-06-11 15:19Estimated read 7 min
CDUR: How Chain-of-Thought Budget in Large Language Models Triggers Overconfidence—A Deep Analysis of Calibration Drift Phenomenon
1

Section 01

Deep Analysis of CDUR Phenomenon: Nonlinear Relationship Between Reasoning Budget of Large Language Models and Overconfidence

This article deeply explores the CDUR (Calibration Drift Under Reasoning) phenomenon, revealing that the calibration error (ECE) of large language models follows a U-shaped curve—first improving then deteriorating—when the reasoning budget increases. Key findings include: 1) The reasoning budget has a non-monotonic correlation with calibration performance; 2) The hypothesis locking model explains the mechanism of overconfidence; 3) The CABStop calibration-aware stopping rule is proposed to dynamically optimize the reasoning budget. The study is based on experiments with Llama series models and has important guiding significance for LLM evaluation and deployment.

2

Section 02

Research Background and Definition of CDUR Phenomenon

The traditional view holds that increasing the LLM reasoning budget can simultaneously improve accuracy and calibration, but the CDUR research team observed the calibration drift phenomenon: when the reasoning budget increases, the Expected Calibration Error (ECE) shows non-monotonic changes. CDUR is defined as: as the reasoning budget B increases, the ECE(B) function follows a U-shaped trajectory, with an optimal budget point beyond which calibration performance declines. The experiments were validated on Llama-3.1-8B and Llama-3.3-70B models, covering 4 budget levels and 21 types of reasoning trap problems.

3

Section 03

CDUR Mechanism: Explanation via Hypothesis Locking Model

To explain CDUR, the study proposes the hypothesis locking model: in autoregressive reasoning, the model initially considers multiple paths openly, but gradually locks onto a certain hypothesis as steps increase. If it locks onto an incorrect hypothesis, subsequent steps will reinforce the wrong belief, leading to overconfidence. This phenomenon is most obvious at the "light" budget level: the model forms a strong belief but does not reach the self-correcting "heavy" level, so ECE peaks at light and decreases at heavy.

4

Section 04

Experimental Design and Dataset Construction

The study constructed a dataset containing 25 reasoning trap problems, covering more than 15 categories such as counting, set theory, and spatial reasoning (trap problems are misleading to human intuition). Multiple seed runs (seeds 1/2/3) were used to ensure statistical significance, and the TrapQuestion data class was used to manage the problems (including ID, category, text, and answer). Evaluation metrics include ECE, overconfidence gap, accuracy, etc.

5

Section 05

Analysis of Core Experimental Results

The experimental results of Llama-3.1-8B show the CDUR phenomenon:

Budget Level ECE (Mean ± Std Dev) Overconfidence Gap Accuracy
none 0.0436±0.015 +0.4930 0.4610
light 0.1040±0.034 +0.2490 0.7320
medium 0.0496±0.049 +0.3360 0.6530
heavy 0.0145±0.005 +0.2450 0.7390
ECE increases from none to light, decreases at medium, and is lowest at heavy; accuracy improves significantly at light but calibration is the worst, indicating a trade-off between accuracy and calibration.
6

Section 06

CABStop: Calibration-Aware Dynamic Stopping Rule

Based on the understanding of CDUR, the CABStop algorithm is proposed: it dynamically monitors the difference between the model's confidence and the auxiliary accuracy estimate, and stops reasoning when the divergence exceeds a threshold. The core idea is to dynamically allocate the budget based on problem difficulty and real-time performance, rather than using a fixed budget. The algorithm estimates the auxiliary accuracy through self-consistency sampling, evaluates at checkpoints, and triggers stopping when the gap between confidence and auxiliary accuracy exceeds delta, balancing accuracy and calibration.

7

Section 07

Research Significance and Future Directions

The CDUR research has guiding significance for LLM evaluation (needing to balance accuracy and calibration), deployment (trading off budget and calibration), and model design (alleviating hypothesis locking). Future directions include: exploring the sensitivity of different model architectures to CDUR, developing more refined calibration-aware strategies, and extending CABStop to multimodal/interactive scenarios.