Zing Forum

Reading

When Large Models Start to 'Doubt Themselves': How Prompt Framing Affects Mathematical Reasoning Ability

An experimental study on Qwen2.5-Math found that when known solvable math problems are described as 'unsolved' or 'open questions', the model's accuracy drops from 60% to 45%. However, further controlled experiments reveal a more nuanced truth: this 'self-doubt' phenomenon is more of an interaction effect between prompt format and answer presentation style, rather than a real degradation of the model's reasoning ability.

大语言模型数学推理提示工程自我怀疑Qwen模型评估AI信心校准
Published 2026-06-12 00:51Recent activity 2026-06-12 01:18Estimated read 7 min
When Large Models Start to 'Doubt Themselves': How Prompt Framing Affects Mathematical Reasoning Ability
1

Section 01

【Introduction】The Truth Behind Large Models' 'Self-Doubt' Phenomenon: Interaction Between Prompt Framing and Answer Format

An experimental study on Qwen2.5-Math found that when known solvable math problems are described as 'unsolved' or 'open questions', the model's accuracy drops from 60% to 45%. However, further controlled experiments reveal that this phenomenon is more of an interaction effect between prompt format and answer presentation style, rather than a real degradation of the model's reasoning ability. This study explores the impact of model confidence on mathematical reasoning performance and related implications.

2

Section 02

Research Background and Motivation

The performance of large language models in mathematical reasoning tasks is a core focus of AI research, but whether model 'confidence' affects performance has been less explored. This experiment, initiated by rishabhsai, uses the Qwen2.5-Math-1.5B-Instruct model to observe changes in the model's reasoning behavior by systematically altering the problem description framework.

3

Section 03

Experimental Design and Methodology

Core Experimental Framework

Adopt a 'paired framing' design, where the same set of known solvable problems are presented in two contexts:

  • Neutral framing: Present the problem directly without difficulty hints
  • Open/unsolved framing: Add guiding phrases like 'open question' or 'no known solution yet'

Evaluation Metrics

Use exact match (final answer completely consistent with the standard answer) as the main criterion to avoid ambiguity in scoring.

Controlled Variables

Fix parameters such as random seed, maximum generation length (384 tokens), and model temperature, and save all original generation results.

4

Section 04

Preliminary Findings and In-depth Exploration

Preliminary Results

Framing Type Exact Match Accuracy
Neutral Framing 60%
Open/Unsolved Framing 45%
Difference -15 percentage points
This result is referred to as 'observable self-doubt'.

Follow-up Controlled Experiments

After introducing the 'answer-first format' (answer first, then reasoning):

Framing Type Answer-first Format Accuracy
Neutral Framing 55%
Open/Unsolved Framing 55%
Difference 0 percentage points

Key Insights

The initial accuracy drop is an interaction effect between prompt format and answer presentation style: free output under neutral framing is more structured, while open framing induces lengthy tentative answers that reduce exact match rates; forcing answer-first format leads to consistent performance.

5

Section 05

Scenarios That Truly Trigger 'Self-Doubt'

  1. Truly open or underdefined problems: When there is insufficient information or the problem is an unsolved puzzle, the model's output is full of phrases like 'cannot be solved' or 'insufficient information'.
  2. Solvable problems: Even under open framing, self-doubt表现 is limited; it is more about changes in answer format rather than a decline in reasoning quality.
6

Section 06

Implications for AI System Design

  1. Importance of prompt engineering: Prompt design has a significant impact; systematic testing of different framing effects is needed.
  2. Limitations of evaluation metrics: Exact match masks actual quality differences; more detailed analysis of the thinking process is required.
  3. Controllability of model confidence: Can be adjusted via prompts (opportunity: adjust caution according to scenarios; risk: malicious prompts induce hesitation or overconfidence).
7

Section 07

Limitations and Future Directions

Limitations

  • Limited sample size (20-50 questions)
  • Single model (only Qwen2.5-Math-1.5B-Instruct)
  • Simplified evaluation (exact match cannot capture partial correctness or reasoning quality)

Future Directions

Expand to more model architectures and larger datasets, and adopt more refined evaluation metrics (step-by-step reasoning accuracy, confidence calibration, etc.).

8

Section 08

Conclusion

This study reveals the intertwined effects of multiple factors such as prompt framing, answer format, and evaluation methods on model performance. It reminds AI researchers and developers to interpret model performance carefully, distinguish between real ability defects and limitations of measurement methods, and build more reliable and trustworthy intelligent systems.