# Can Large Reasoning Models Identify False Presuppositions? An Empirical Study on Hypothetical Queries

> This study systematically evaluates the ability of Large Reasoning Models (LRMs) to handle queries containing false presuppositions. The results show that although LRMs have a 2-11% higher accuracy rate than non-reasoning models, 26-42% of false presuppositions remain unchallenged, and the models are sensitive to the strength of presupposition expressions.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-04T18:15:28.000Z
- 最近活动: 2026-05-06T02:28:10.744Z
- 热度: 118.8
- 关键词: 大推理模型, 预设识别, 错误假设, 批判性思维, AI安全, 查询理解, 推理能力, 信息验证
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2605-03050v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2605-03050v1
- Markdown 来源: floors_fallback

---

## [Introduction] Evaluation Study on the Ability of Large Reasoning Models to Identify False Presuppositions

This study systematically evaluates the ability of Large Reasoning Models (LRMs) to handle queries containing false presuppositions. The results show that compared to non-reasoning models, LRMs have a 2-11% higher accuracy rate, but 26-42% of false presuppositions remain unchallenged, and the models are sensitive to the strength of presupposition expressions. This study has important implications for AI system design and user usage.

## Background: The Raising of the False Presupposition Problem and Limitations of Existing Research

User queries often contain false presuppositions; if AI answers without discrimination, it will reinforce wrong perceptions. Early Large Language Models (LLMs) could not effectively identify false presuppositions, due to reasons such as training data mostly based on correct premises and interaction design tending to direct answers. The new generation of LRMs theoretically has better recognition ability, but empirical verification is needed.

## Research Methods: Building an Evaluation Benchmark for Presupposition Queries

The study constructed a multi-domain (health, science, common sense) test set, covering presuppositions of different strengths (strong assertions/weak implications). The evaluation criteria are: identifying false presuppositions, pointing out inconsistencies with facts, providing correct information, and responding politely.

## Key Findings: Progress and Limitations of Reasoning Models

1. The accuracy rate of LRMs in identifying false presuppositions is 2-11% higher than that of non-reasoning models; 2. 26-42% of false presuppositions still remain unchallenged; 3. Models are sensitive to the strength of presuppositions: strong assertions are easily accepted, while weak rumors are easily verified.

## In-depth Analysis: Reasons for the Failure of Reasoning Models

1. Limitations of reasoning chains: mostly forward reasoning rather than questioning premises; 2. Training data bias: most Q&A assumes correct premises; 3. Trade-off between safety and usefulness: avoiding adversarial responses leads to accepting false premises.

## Improvement Directions and Recommendations

1. Presupposition recognition training: introduce adversarial training with samples containing false presuppositions; 2. Reasoning guidance: require premise checking through system prompts; 3. Multi-turn interaction: confirm first when possible false presuppositions are detected; 4. Domain-specific mechanisms: automatically check common rumor presuppositions in high-risk domains (e.g., health).

## Conclusions and Implications

Although LRMs have made progress, their performance in handling false presuppositions is still not ideal. Designers need to focus on the critical thinking ability of models, and users need to maintain a critical attitude and cross-verify when obtaining information. In the future, AI systems that balance usefulness and error correction need to be designed.
