Zing Forum

Reading

Can Large Reasoning Models Identify False Presuppositions? An Empirical Study on Hypothetical Queries

This study systematically evaluates the ability of Large Reasoning Models (LRMs) to handle queries containing false presuppositions. The results show that although LRMs have a 2-11% higher accuracy rate than non-reasoning models, 26-42% of false presuppositions remain unchallenged, and the models are sensitive to the strength of presupposition expressions.

大推理模型预设识别错误假设批判性思维AI安全查询理解推理能力信息验证
Published 2026-05-05 02:15Recent activity 2026-05-06 10:28Estimated read 5 min
Can Large Reasoning Models Identify False Presuppositions? An Empirical Study on Hypothetical Queries
1

Section 01

[Introduction] Evaluation Study on the Ability of Large Reasoning Models to Identify False Presuppositions

This study systematically evaluates the ability of Large Reasoning Models (LRMs) to handle queries containing false presuppositions. The results show that compared to non-reasoning models, LRMs have a 2-11% higher accuracy rate, but 26-42% of false presuppositions remain unchallenged, and the models are sensitive to the strength of presupposition expressions. This study has important implications for AI system design and user usage.

2

Section 02

Background: The Raising of the False Presupposition Problem and Limitations of Existing Research

User queries often contain false presuppositions; if AI answers without discrimination, it will reinforce wrong perceptions. Early Large Language Models (LLMs) could not effectively identify false presuppositions, due to reasons such as training data mostly based on correct premises and interaction design tending to direct answers. The new generation of LRMs theoretically has better recognition ability, but empirical verification is needed.

3

Section 03

Research Methods: Building an Evaluation Benchmark for Presupposition Queries

The study constructed a multi-domain (health, science, common sense) test set, covering presuppositions of different strengths (strong assertions/weak implications). The evaluation criteria are: identifying false presuppositions, pointing out inconsistencies with facts, providing correct information, and responding politely.

4

Section 04

Key Findings: Progress and Limitations of Reasoning Models

  1. The accuracy rate of LRMs in identifying false presuppositions is 2-11% higher than that of non-reasoning models; 2. 26-42% of false presuppositions still remain unchallenged; 3. Models are sensitive to the strength of presuppositions: strong assertions are easily accepted, while weak rumors are easily verified.
5

Section 05

In-depth Analysis: Reasons for the Failure of Reasoning Models

  1. Limitations of reasoning chains: mostly forward reasoning rather than questioning premises; 2. Training data bias: most Q&A assumes correct premises; 3. Trade-off between safety and usefulness: avoiding adversarial responses leads to accepting false premises.
6

Section 06

Improvement Directions and Recommendations

  1. Presupposition recognition training: introduce adversarial training with samples containing false presuppositions; 2. Reasoning guidance: require premise checking through system prompts; 3. Multi-turn interaction: confirm first when possible false presuppositions are detected; 4. Domain-specific mechanisms: automatically check common rumor presuppositions in high-risk domains (e.g., health).
7

Section 07

Conclusions and Implications

Although LRMs have made progress, their performance in handling false presuppositions is still not ideal. Designers need to focus on the critical thinking ability of models, and users need to maintain a critical attitude and cross-verify when obtaining information. In the future, AI systems that balance usefulness and error correction need to be designed.