# Semantic Correctness Evaluation of Automated Theorem Proving: From Compilation Success to Integration Testing

> The research team proposes a new theorem proving evaluation framework that measures semantic correctness by checking the compilation success rate of dependent subsequent theorems. It was found that even state-of-the-art models only achieve an accuracy rate of 38.9% under strict semantic evaluation.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-26T13:24:20.000Z
- 最近活动: 2026-04-28T02:05:46.321Z
- 热度: 110.3
- 关键词: 自动定理证明, 语义评估, 集成测试, Lean 4, 形式化验证, 基准测试
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2604-23698v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2604-23698v1
- Markdown 来源: floors_fallback

---

## 【Introduction】New Semantic Evaluation Framework for Automated Theorem Proving: T-Test Reveals Real Capability Gaps

The research team proposes the T-Test evaluation framework, which measures semantic correctness by checking the compilation success rate of dependent subsequent theorems. It was found that state-of-the-art models only achieve an accuracy rate of 38.9% under strict semantic evaluation. This framework draws on the idea of integration testing in software engineering and provides a more rigorous evaluation standard for the field.

## Background: Evaluation Dilemma of Automated Theorem Proving

Existing evaluation methods have limitations: lexical overlap only compares surface similarity and cannot reflect logical correctness; manual review is accurate but costly and difficult to scale. This dilemma restricts the development of the field, making it impossible for developers to accurately understand model capabilities and for researchers to compare the pros and cons of different methods.

## Method: T-Test Framework Framework – Semantic Evaluation Idea Inspired by Integration Testing

Inspired by test-driven evaluation in the code generation field, the T-Test framework is proposed: a generated theorem is semantically correct if and only if all its dependent subsequent theorems can be successfully compiled. Analogous to integration testing in software engineering, it emphasizes the supporting role of a theorem in the entire theoretical system, rather than just passing local compilation.

## Evidence: Benchmark Dataset and Experimental Results

A large-scale benchmark dataset was constructed: sourced from 5 real Lean4 code repositories, containing 2206 theorem problems, each with an average of 41 subsequent theorems (automatically extracted). Experiments show: state-of-the-art models have high compilation success rates, but their performance drops significantly under T-Test evaluation; Claude-Sonnet-4.5 only achieves an accuracy rate of 38.9% under ideal conditions; providing context can improve generation quality.

## Conclusion: Analysis of Key Gaps in Current Model Capabilities

The 38.9% accuracy rate reveals core issues: insufficient formal rigor (logical loopholes or improper boundary handling), limited understanding of dependency relationships (ignoring global consistency), lack of long-range reasoning ability, and insufficient diversity in training data (insufficient coverage of boundary cases).

## Recommendations: Implications for Field Development

It is necessary to rethink evaluation standards and adopt semantic correctness evaluation; objectively recognize model capabilities and avoid over-optimism; improve training strategies (focus on semantic correctness, introduce T-Test feedback); optimize human-machine collaboration (AI generates candidates + human verification and correction).

## Limitations and Future Directions: Improvement Space of the T-Test Framework

Framework limitations: high computational cost, dependency completeness assumption (may miss key dependencies), difficulty in error localization. Future directions: develop efficient approximate evaluation methods, build comprehensive dependency analysis tools, integrate the framework into the model training process to achieve closed-loop optimization.