Reading

Can Large Reasoning Models Identify False Presuppositions? An Empirical Study on Hypothetical Queries

This study systematically evaluates the ability of Large Reasoning Models (LRMs) to handle queries containing false presuppositions. The results show that although LRMs have a 2-11% higher accuracy rate than non-reasoning models, 26-42% of false presuppositions remain unchallenged, and the models are sensitive to the strength of presupposition expressions.

大推理模型预设识别错误假设批判性思维AI安全查询理解推理能力信息验证

Published 2026-05-05 02:15Recent activity 2026-05-06 10:28Estimated read 5 min

Can Large Reasoning Models Identify False Presuppositions? An Empirical Study on Hypothetical Queries

Section 01

[Introduction] Evaluation Study on the Ability of Large Reasoning Models to Identify False Presuppositions

This study systematically evaluates the ability of Large Reasoning Models (LRMs) to handle queries containing false presuppositions. The results show that compared to non-reasoning models, LRMs have a 2-11% higher accuracy rate, but 26-42% of false presuppositions remain unchallenged, and the models are sensitive to the strength of presupposition expressions. This study has important implications for AI system design and user usage.

Section 02

Background: The Raising of the False Presupposition Problem and Limitations of Existing Research

User queries often contain false presuppositions; if AI answers without discrimination, it will reinforce wrong perceptions. Early Large Language Models (LLMs) could not effectively identify false presuppositions, due to reasons such as training data mostly based on correct premises and interaction design tending to direct answers. The new generation of LRMs theoretically has better recognition ability, but empirical verification is needed.

Section 03

Research Methods: Building an Evaluation Benchmark for Presupposition Queries

The study constructed a multi-domain (health, science, common sense) test set, covering presuppositions of different strengths (strong assertions/weak implications). The evaluation criteria are: identifying false presuppositions, pointing out inconsistencies with facts, providing correct information, and responding politely.

Section 04

Key Findings: Progress and Limitations of Reasoning Models

The accuracy rate of LRMs in identifying false presuppositions is 2-11% higher than that of non-reasoning models; 2. 26-42% of false presuppositions still remain unchallenged; 3. Models are sensitive to the strength of presuppositions: strong assertions are easily accepted, while weak rumors are easily verified.

Section 05

In-depth Analysis: Reasons for the Failure of Reasoning Models

Limitations of reasoning chains: mostly forward reasoning rather than questioning premises; 2. Training data bias: most Q&A assumes correct premises; 3. Trade-off between safety and usefulness: avoiding adversarial responses leads to accepting false premises.

Section 06

Improvement Directions and Recommendations

Presupposition recognition training: introduce adversarial training with samples containing false presuppositions; 2. Reasoning guidance: require premise checking through system prompts; 3. Multi-turn interaction: confirm first when possible false presuppositions are detected; 4. Domain-specific mechanisms: automatically check common rumor presuppositions in high-risk domains (e.g., health).

Section 07

Conclusions and Implications

Although LRMs have made progress, their performance in handling false presuppositions is still not ideal. Designers need to focus on the critical thinking ability of models, and users need to maintain a critical attitude and cross-verify when obtaining information. In the future, AI systems that balance usefulness and error correction need to be designed.

Can Large Reasoning Models Identify False Presuppositions? An Empirical Study on Hypothetical Queries

[Introduction] Evaluation Study on the Ability of Large Reasoning Models to Identify False Presuppositions

Background: The Raising of the False Presupposition Problem and Limitations of Existing Research

Research Methods: Building an Evaluation Benchmark for Presupposition Queries

Key Findings: Progress and Limitations of Reasoning Models

In-depth Analysis: Reasons for the Failure of Reasoning Models

Improvement Directions and Recommendations

Conclusions and Implications

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model