Zing Forum

Reading

Prompt Sensitivity Study: How Misleading Prompts Cause a 60% Plunge in LLMs' Reasoning Ability

An experimental study on open-source language models shows that even subtle prompt hints can significantly alter a model's reasoning behavior, with misleading prompts turning 60% of correct answers into errors.

大语言模型提示工程推理能力提示敏感性对抗性提示认知偏差Phi-3模型评估
Published 2026-06-08 03:32Recent activity 2026-06-08 03:52Estimated read 8 min
Prompt Sensitivity Study: How Misleading Prompts Cause a 60% Plunge in LLMs' Reasoning Ability
1

Section 01

[Introduction] Core Findings of Prompt Sensitivity Study: Misleading Prompts Cause 60% Plunge in LLMs' Reasoning Ability

This study was published by Hawa-Hardy on GitHub (original link: https://github.com/Hawa-Hardy/Do-hints-influence-reasoning-models-). It conducted experiments on open-source language models, with the core finding that misleading prompts can turn 60% of correct answers into errors. The study focuses on the robustness of LLMs' reasoning ability, exploring how subtle hints in prompts affect model behavior, and has important implications for prompt engineering, AI safety, and other fields.

2

Section 02

Research Background and Motivation

As large language models (LLMs) improve their performance on various reasoning tasks, a key question arises: Is the model's reasoning ability truly robust? Is it susceptible to subtle hints in prompts? Through systematic experiments, this study quantifies the impact of prompt sensitivity on the reasoning behavior of open-source models, with the core question being: To what extent can misleading prompts turn originally correct answers into errors?

3

Section 03

Experimental Design Methodology

Test Question Selection

10 classic reasoning questions were selected, covering multiple cognitive domains such as language parsing traps, multi-step planning, Cognitive Reflection Test (CRT), and spatial reasoning.

Three Prompt Conditions

Condition Description
Clean Only provide the question, no hints
Helpful Question + hints that help understand key concepts
Misleading Question + hints that guide to wrong methods

Models and Environment

  • Main test model: microsoft/Phi-3-mini-4k-instruct (runs without tokens, 4k context is sufficient)
  • Alternative model: google/gemma-2-2b-it (requires Hugging Face authorization)
  • Runtime environment: Google Colab T4 GPU
4

Section 04

Core Finding: 60% of Answers Go Wrong Due to Misleading Prompts

The study's most striking result: When misleading prompts are introduced, 60% (6/10) of correct answers become wrong. This finding has multiple implications:

  1. Reasoning Fragility: The model's reasoning ability may be more fragile than it seems; unintended keywords or hints from users may cause the model to deviate from the correct path (similar to the human anchoring effect).
  2. Double-Edged Sword of Prompt Engineering: Prompt engineering is both a tool to improve performance and can reduce it; well-intentioned prompts with improper wording may also have negative impacts.
  3. Safety and Alignment Considerations: Prompt sensitivity may be maliciously exploited to induce wrong outputs via prompt injection, which is particularly dangerous in high-risk scenarios like healthcare and law.
5

Section 05

Links to Related Research

The methodology of this study draws on techniques from multiple fields:

  • Mechanical Interpretability: Understanding the model's internal information processing mechanism
  • LLM Evaluation Methodology: Benchmarks and protocols for standardized model capability testing
  • Adversarial Prompt Research: Exploring ways to manipulate model behavior via input
  • Cognitive Bias Research: Applying human psychology experimental designs to language models The design of the three prompt conditions echoes classic experimental paradigms in cognitive science regarding biases and heuristics.
6

Section 06

Practical Implications and Recommendations

Recommendations for Developers

  1. Prompt Auditing: Regularly check system prompts in production environments to eliminate potential misleading language
  2. Multi-Prompt Testing: Use multiple prompts with different wording for cross-validation in critical tasks
  3. User Input Purification: Perform semantic analysis to detect interference when incorporating user input

Implications for Researchers

  1. Limitations of Benchmark Testing: Current standard benchmarks may overestimate the model's true reasoning ability (due to using clean prompts)
  2. Robustness Evaluation: Need to develop evaluation protocols specifically for testing models' robustness to prompt changes
  3. Causal Mechanism Exploration: Deeply study the causes and internal changes of models being misled by prompts
7

Section 07

Reproduction Path and Conclusion

Reproduction Steps

  1. Open reasoning_experiment.ipynb in Google Colab
  2. Set the runtime to T4 GPU
  3. Run all cells in order
  4. Manually evaluate each response
  5. Re-run the analysis cells to get statistical results

Conclusion

Although this study is small in scale, it reveals the robustness issue of LLMs' reasoning ability. The 60% performance drop reminds us that we need to fully consider the risk of prompt sensitivity before deploying LLMs to critical applications. Only by understanding the model's capabilities and limitations can we use this technology responsibly.