Section 01
【Introduction】New Semantic Evaluation Framework for Automated Theorem Proving: T-Test Reveals Real Capability Gaps
The research team proposes the T-Test evaluation framework, which measures semantic correctness by checking the compilation success rate of dependent subsequent theorems. It was found that state-of-the-art models only achieve an accuracy rate of 38.9% under strict semantic evaluation. This framework draws on the idea of integration testing in software engineering and provides a more rigorous evaluation standard for the field.