Section 01
[Introduction] Evaluating LLM Architectural Reasoning Ability: Insights from the Obfuscated Natural Number Game
This paper evaluates the architectural reasoning ability of large language models (LLMs) in a zero-knowledge environment using the obfuscated Natural Number Game benchmark. Key findings: Reasoning models (e.g., DeepSeek-R1, GPT-5) maintain accuracy after removing semantic cues, while general-purpose models show performance degradation. The study aims to distinguish whether models rely on semantic pattern matching or true logical reasoning ability.