# Prompt Sensitivity Study: How Misleading Prompts Cause a 60% Plunge in LLMs' Reasoning Ability

> An experimental study on open-source language models shows that even subtle prompt hints can significantly alter a model's reasoning behavior, with misleading prompts turning 60% of correct answers into errors.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-07T19:32:17.000Z
- 最近活动: 2026-06-07T19:52:15.901Z
- 热度: 150.7
- 关键词: 大语言模型, 提示工程, 推理能力, 提示敏感性, 对抗性提示, 认知偏差, Phi-3, 模型评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/60-5f734545
- Canonical: https://www.zingnex.cn/forum/thread/60-5f734545
- Markdown 来源: floors_fallback

---

## [Introduction] Core Findings of Prompt Sensitivity Study: Misleading Prompts Cause 60% Plunge in LLMs' Reasoning Ability

This study was published by Hawa-Hardy on GitHub (original link: https://github.com/Hawa-Hardy/Do-hints-influence-reasoning-models-). It conducted experiments on open-source language models, with the core finding that misleading prompts can turn 60% of correct answers into errors. The study focuses on the robustness of LLMs' reasoning ability, exploring how subtle hints in prompts affect model behavior, and has important implications for prompt engineering, AI safety, and other fields.

## Research Background and Motivation

As large language models (LLMs) improve their performance on various reasoning tasks, a key question arises: Is the model's reasoning ability truly robust? Is it susceptible to subtle hints in prompts? Through systematic experiments, this study quantifies the impact of prompt sensitivity on the reasoning behavior of open-source models, with the core question being: To what extent can misleading prompts turn originally correct answers into errors?

## Experimental Design Methodology

### Test Question Selection
10 classic reasoning questions were selected, covering multiple cognitive domains such as language parsing traps, multi-step planning, Cognitive Reflection Test (CRT), and spatial reasoning.

### Three Prompt Conditions
| Condition | Description |
|------|------|
| Clean | Only provide the question, no hints |
| Helpful | Question + hints that help understand key concepts |
| Misleading | Question + hints that guide to wrong methods |

### Models and Environment
- Main test model: microsoft/Phi-3-mini-4k-instruct (runs without tokens, 4k context is sufficient)
- Alternative model: google/gemma-2-2b-it (requires Hugging Face authorization)
- Runtime environment: Google Colab T4 GPU

## Core Finding: 60% of Answers Go Wrong Due to Misleading Prompts

The study's most striking result: When misleading prompts are introduced, 60% (6/10) of correct answers become wrong. This finding has multiple implications:
1. **Reasoning Fragility**: The model's reasoning ability may be more fragile than it seems; unintended keywords or hints from users may cause the model to deviate from the correct path (similar to the human anchoring effect).
2. **Double-Edged Sword of Prompt Engineering**: Prompt engineering is both a tool to improve performance and can reduce it; well-intentioned prompts with improper wording may also have negative impacts.
3. **Safety and Alignment Considerations**: Prompt sensitivity may be maliciously exploited to induce wrong outputs via prompt injection, which is particularly dangerous in high-risk scenarios like healthcare and law.

## Links to Related Research

The methodology of this study draws on techniques from multiple fields:
- **Mechanical Interpretability**: Understanding the model's internal information processing mechanism
- **LLM Evaluation Methodology**: Benchmarks and protocols for standardized model capability testing
- **Adversarial Prompt Research**: Exploring ways to manipulate model behavior via input
- **Cognitive Bias Research**: Applying human psychology experimental designs to language models
The design of the three prompt conditions echoes classic experimental paradigms in cognitive science regarding biases and heuristics.

## Practical Implications and Recommendations

### Recommendations for Developers
1. Prompt Auditing: Regularly check system prompts in production environments to eliminate potential misleading language
2. Multi-Prompt Testing: Use multiple prompts with different wording for cross-validation in critical tasks
3. User Input Purification: Perform semantic analysis to detect interference when incorporating user input

### Implications for Researchers
1. Limitations of Benchmark Testing: Current standard benchmarks may overestimate the model's true reasoning ability (due to using clean prompts)
2. Robustness Evaluation: Need to develop evaluation protocols specifically for testing models' robustness to prompt changes
3. Causal Mechanism Exploration: Deeply study the causes and internal changes of models being misled by prompts

## Reproduction Path and Conclusion

### Reproduction Steps
1. Open `reasoning_experiment.ipynb` in Google Colab
2. Set the runtime to T4 GPU
3. Run all cells in order
4. Manually evaluate each response
5. Re-run the analysis cells to get statistical results

### Conclusion
Although this study is small in scale, it reveals the robustness issue of LLMs' reasoning ability. The 60% performance drop reminds us that we need to fully consider the risk of prompt sensitivity before deploying LLMs to critical applications. Only by understanding the model's capabilities and limitations can we use this technology responsibly.