Zing Forum

Reading

Semantic Correctness Evaluation of Automated Theorem Proving: From Compilation Success to Integration Testing

The research team proposes a new theorem proving evaluation framework that measures semantic correctness by checking the compilation success rate of dependent subsequent theorems. It was found that even state-of-the-art models only achieve an accuracy rate of 38.9% under strict semantic evaluation.

自动定理证明语义评估集成测试Lean 4形式化验证基准测试
Published 2026-04-26 21:24Recent activity 2026-04-28 10:05Estimated read 5 min
Semantic Correctness Evaluation of Automated Theorem Proving: From Compilation Success to Integration Testing
1

Section 01

【Introduction】New Semantic Evaluation Framework for Automated Theorem Proving: T-Test Reveals Real Capability Gaps

The research team proposes the T-Test evaluation framework, which measures semantic correctness by checking the compilation success rate of dependent subsequent theorems. It was found that state-of-the-art models only achieve an accuracy rate of 38.9% under strict semantic evaluation. This framework draws on the idea of integration testing in software engineering and provides a more rigorous evaluation standard for the field.

2

Section 02

Background: Evaluation Dilemma of Automated Theorem Proving

Existing evaluation methods have limitations: lexical overlap only compares surface similarity and cannot reflect logical correctness; manual review is accurate but costly and difficult to scale. This dilemma restricts the development of the field, making it impossible for developers to accurately understand model capabilities and for researchers to compare the pros and cons of different methods.

3

Section 03

Method: T-Test Framework Framework – Semantic Evaluation Idea Inspired by Integration Testing

Inspired by test-driven evaluation in the code generation field, the T-Test framework is proposed: a generated theorem is semantically correct if and only if all its dependent subsequent theorems can be successfully compiled. Analogous to integration testing in software engineering, it emphasizes the supporting role of a theorem in the entire theoretical system, rather than just passing local compilation.

4

Section 04

Evidence: Benchmark Dataset and Experimental Results

A large-scale benchmark dataset was constructed: sourced from 5 real Lean4 code repositories, containing 2206 theorem problems, each with an average of 41 subsequent theorems (automatically extracted). Experiments show: state-of-the-art models have high compilation success rates, but their performance drops significantly under T-Test evaluation; Claude-Sonnet-4.5 only achieves an accuracy rate of 38.9% under ideal conditions; providing context can improve generation quality.

5

Section 05

Conclusion: Analysis of Key Gaps in Current Model Capabilities

The 38.9% accuracy rate reveals core issues: insufficient formal rigor (logical loopholes or improper boundary handling), limited understanding of dependency relationships (ignoring global consistency), lack of long-range reasoning ability, and insufficient diversity in training data (insufficient coverage of boundary cases).

6

Section 06

Recommendations: Implications for Field Development

It is necessary to rethink evaluation standards and adopt semantic correctness evaluation; objectively recognize model capabilities and avoid over-optimism; improve training strategies (focus on semantic correctness, introduce T-Test feedback); optimize human-machine collaboration (AI generates candidates + human verification and correction).

7

Section 07

Limitations and Future Directions: Improvement Space of the T-Test Framework

Framework limitations: high computational cost, dependency completeness assumption (may miss key dependencies), difficulty in error localization. Future directions: develop efficient approximate evaluation methods, build comprehensive dependency analysis tools, integrate the framework into the model training process to achieve closed-loop optimization.