Section 01
Hidden Error Awareness in Chain-of-Thought Reasoning: Models Internally Recognize Mistakes but Externally Remain Confident
This study reveals the phenomenon of hidden error awareness in large language models during chain-of-thought reasoning: models can internally detect their own reasoning errors (hidden state prediction accuracy reaches 0.95 AUROC), but their externally expressed confidence is almost indistinguishable from that of correct reasoning. This signal is only diagnostic (can judge whether reasoning is correct) rather than causal (cannot correct errors through existing interventions), challenging the assumption that chain-of-thought reasoning reflects internal computations.