# When Large Models Start to 'Doubt Themselves': How Prompt Framing Affects Mathematical Reasoning Ability

> An experimental study on Qwen2.5-Math found that when known solvable math problems are described as 'unsolved' or 'open questions', the model's accuracy drops from 60% to 45%. However, further controlled experiments reveal a more nuanced truth: this 'self-doubt' phenomenon is more of an interaction effect between prompt format and answer presentation style, rather than a real degradation of the model's reasoning ability.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-11T16:51:19.000Z
- 最近活动: 2026-06-11T17:18:03.551Z
- 热度: 157.6
- 关键词: 大语言模型, 数学推理, 提示工程, 自我怀疑, Qwen, 模型评估, AI信心校准
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-rishabhsai-math-self-doubt
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-rishabhsai-math-self-doubt
- Markdown 来源: floors_fallback

---

## 【Introduction】The Truth Behind Large Models' 'Self-Doubt' Phenomenon: Interaction Between Prompt Framing and Answer Format

An experimental study on Qwen2.5-Math found that when known solvable math problems are described as 'unsolved' or 'open questions', the model's accuracy drops from 60% to 45%. However, further controlled experiments reveal that this phenomenon is more of an interaction effect between prompt format and answer presentation style, rather than a real degradation of the model's reasoning ability. This study explores the impact of model confidence on mathematical reasoning performance and related implications.

## Research Background and Motivation

The performance of large language models in mathematical reasoning tasks is a core focus of AI research, but whether model 'confidence' affects performance has been less explored. This experiment, initiated by rishabhsai, uses the Qwen2.5-Math-1.5B-Instruct model to observe changes in the model's reasoning behavior by systematically altering the problem description framework.

## Experimental Design and Methodology

### Core Experimental Framework
Adopt a 'paired framing' design, where the same set of known solvable problems are presented in two contexts:
- Neutral framing: Present the problem directly without difficulty hints
- Open/unsolved framing: Add guiding phrases like 'open question' or 'no known solution yet'

### Evaluation Metrics
Use exact match (final answer completely consistent with the standard answer) as the main criterion to avoid ambiguity in scoring.

### Controlled Variables
Fix parameters such as random seed, maximum generation length (384 tokens), and model temperature, and save all original generation results.

## Preliminary Findings and In-depth Exploration

#### Preliminary Results
| Framing Type | Exact Match Accuracy |
|---|---|
| Neutral Framing |60%|
| Open/Unsolved Framing |45%|
| Difference |-15 percentage points|
This result is referred to as 'observable self-doubt'.

#### Follow-up Controlled Experiments
After introducing the 'answer-first format' (answer first, then reasoning):
| Framing Type | Answer-first Format Accuracy |
|---|---|
| Neutral Framing |55%|
| Open/Unsolved Framing |55%|
| Difference |0 percentage points|

#### Key Insights
The initial accuracy drop is an interaction effect between prompt format and answer presentation style: free output under neutral framing is more structured, while open framing induces lengthy tentative answers that reduce exact match rates; forcing answer-first format leads to consistent performance.

## Scenarios That Truly Trigger 'Self-Doubt'

1. **Truly open or underdefined problems**: When there is insufficient information or the problem is an unsolved puzzle, the model's output is full of phrases like 'cannot be solved' or 'insufficient information'.
2. **Solvable problems**: Even under open framing, self-doubt表现 is limited; it is more about changes in answer format rather than a decline in reasoning quality.

## Implications for AI System Design

1. **Importance of prompt engineering**: Prompt design has a significant impact; systematic testing of different framing effects is needed.
2. **Limitations of evaluation metrics**: Exact match masks actual quality differences; more detailed analysis of the thinking process is required.
3. **Controllability of model confidence**: Can be adjusted via prompts (opportunity: adjust caution according to scenarios; risk: malicious prompts induce hesitation or overconfidence).

## Limitations and Future Directions

#### Limitations
- Limited sample size (20-50 questions)
- Single model (only Qwen2.5-Math-1.5B-Instruct)
- Simplified evaluation (exact match cannot capture partial correctness or reasoning quality)

#### Future Directions
Expand to more model architectures and larger datasets, and adopt more refined evaluation metrics (step-by-step reasoning accuracy, confidence calibration, etc.).

## Conclusion

This study reveals the intertwined effects of multiple factors such as prompt framing, answer format, and evaluation methods on model performance. It reminds AI researchers and developers to interpret model performance carefully, distinguish between real ability defects and limitations of measurement methods, and build more reliable and trustworthy intelligent systems.
