Section 01
Deep Analysis of CDUR Phenomenon: Nonlinear Relationship Between Reasoning Budget of Large Language Models and Overconfidence
This article deeply explores the CDUR (Calibration Drift Under Reasoning) phenomenon, revealing that the calibration error (ECE) of large language models follows a U-shaped curve—first improving then deteriorating—when the reasoning budget increases. Key findings include: 1) The reasoning budget has a non-monotonic correlation with calibration performance; 2) The hypothesis locking model explains the mechanism of overconfidence; 3) The CABStop calibration-aware stopping rule is proposed to dynamically optimize the reasoning budget. The study is based on experiments with Llama series models and has important guiding significance for LLM evaluation and deployment.