Zing Forum

Reading

Evaluating the Architectural Reasoning Ability of Large Language Model Provers: Insights from the Obfuscated Natural Number Game

This paper evaluates the architectural reasoning ability of LLMs in a zero-knowledge environment using the obfuscated Natural Number Game benchmark. The study finds that reasoning models (e.g., DeepSeek-R1, GPT-5) maintain accuracy after removing semantic cues, while general-purpose models experience performance degradation.

架构推理形式化数学定理证明Lean 4混淆测试DeepSeek-R1GPT-5自动化定理发现
Published 2026-05-01 22:03Recent activity 2026-05-04 10:18Estimated read 6 min
Evaluating the Architectural Reasoning Ability of Large Language Model Provers: Insights from the Obfuscated Natural Number Game
1

Section 01

[Introduction] Evaluating LLM Architectural Reasoning Ability: Insights from the Obfuscated Natural Number Game

This paper evaluates the architectural reasoning ability of large language models (LLMs) in a zero-knowledge environment using the obfuscated Natural Number Game benchmark. Key findings: Reasoning models (e.g., DeepSeek-R1, GPT-5) maintain accuracy after removing semantic cues, while general-purpose models show performance degradation. The study aims to distinguish whether models rely on semantic pattern matching or true logical reasoning ability.

2

Section 02

Problem Background: The Debate Over True Reasoning in Formalized Mathematics

LLMs have made significant progress on formalized mathematics benchmarks like MiniF2F, but there is doubt about whether their success stems from logical reasoning or semantic pattern matching. Researchers propose the concept of "architectural reasoning"—the ability to synthesize proofs in unfamiliar mathematical domains using only local axiom definitions, which is a core skill for AI that automates theorem discovery.

3

Section 03

Methodology: Zero-Knowledge Test Design Using the Obfuscated Natural Number Game

A test environment was built based on the Lean4 Natural Number Game. By renaming all identifiers (types, functions, theorems, variables), a zero-knowledge closed environment was formed. Models cannot rely on pre-trained mathematical knowledge and must reason using only local axioms. The design isolates semantic pattern matching from architectural reasoning ability: if a model's performance degrades after obfuscation, it relies on semantic memory; otherwise, it possesses architectural reasoning ability.

4

Section 04

Experimental Results: Robustness Divergence Between Reasoning Models and General-Purpose Models

All models face a "latency tax" (increased reasoning time). General-purpose models (Claude-Sonnet-4.5, GPT-4o) show significant performance degradation after obfuscation, relying on semantic cues; reasoning models (DeepSeek-R1, GPT-5, DeepSeek-Prover-V2) maintain stable accuracy, possess abstract reasoning ability, and do not depend on domain-specific prior knowledge.

5

Section 05

Essence and Importance of Architectural Reasoning

Architectural reasoning includes four elements: axiom understanding, strategy discovery, combinatorial exploration, and error recovery—it is not just symbolic manipulation. AI that automates theorem discovery needs to explore unknown mathematical domains without pre-trained semantic knowledge, relying on architectural reasoning to build new theories.

6

Section 06

Implications for AI Ability Evaluation

Traditional benchmarks may overestimate a model's true ability (if data overlaps with pre-training, it relies on memory); obfuscation tests effectively evaluate true reasoning ability; for applications requiring exploration of new domains (scientific discovery, formal verification), reasoning models should be chosen even if their scores on conventional benchmarks are similar.

7

Section 07

Technical Details: Lean4 Environment and Obfuscation Strategy

Lean4 is a powerful theorem prover, and the Natural Number Game is an interactive educational environment. Obfuscation strategy: replace type names with random strings, use meaningless identifiers for function operators, anonymize theorem names, and generate unified symbols for variable names—ensuring models cannot utilize external knowledge.

8

Section 08

Future Research Directions

Extend obfuscation tests to code synthesis, logical puzzles, and scientific reasoning; design better training strategies (training in diverse formalized environments to enhance general reasoning); promote human-machine collaborative theorem discovery (reasoning models assist humans in exploring new mathematical domains).