Zing Forum

Reading

New Benchmark for Mathematical Analysis Theorem Proving: MA-ProofBench Reveals Shortcomings of Large Models in Advanced Mathematical Reasoning

The first formal theorem proving benchmark for mathematical analysis, MA-ProofBench, has been released. It covers 200 theorems in measure theory, complex analysis, functional analysis, and other areas. Tests show that even GPT-5.5 only has a 5% pass rate on PhD-level questions, exposing the significant limitations of current large models in formal reasoning for advanced mathematics.

形式化数学定理证明数学分析基准测试大模型评估LeanMathlib高等数学
Published 2026-06-12 02:00Recent activity 2026-06-15 10:51Estimated read 6 min
New Benchmark for Mathematical Analysis Theorem Proving: MA-ProofBench Reveals Shortcomings of Large Models in Advanced Mathematical Reasoning
1

Section 01

MA-ProofBench Benchmark Released: Revealing Shortcomings of Large Models in Formal Reasoning for Advanced Mathematics

The first formal theorem proving benchmark specifically for mathematical analysis, MA-ProofBench, has been officially released. It covers 200 theorems in measure theory, complex analysis, functional analysis, and other areas (100 undergraduate-level questions and 100 PhD qualifying exam-level questions). Tests show that even the top-performing GPT-5.5 only has a 5% pass rate on PhD-level questions, exposing the significant limitations of current large models in formal reasoning for advanced mathematics. This benchmark fills the gap in existing evaluations in the field of mathematical analysis and provides an important reference for assessing AI's mathematical capabilities.

2

Section 02

Background: Existing Limitations of Formal Mathematical Reasoning

While large language models have made progress in automated theorem proving, existing benchmarks have two major limitations: 1. Narrow coverage of mathematics, mostly focusing on easily formalizable areas like algebra and elementary number theory; 2. Insufficient difficulty levels, with limited coverage of branches requiring deep reasoning such as mathematical analysis. As the cornerstone of modern mathematics, formal reasoning in mathematical analysis demands higher standards from AI, and existing benchmarks cannot accurately evaluate the true capabilities of large models in advanced mathematics.

3

Section 03

Details of the MA-ProofBench Benchmark

MA-ProofBench is the first formal theorem proving benchmark for mathematical analysis, constructed under human leadership with LLM assistance and reviewed by experts. Content composition: 200 formal theorems covering 6 core topics (measure theory, complex analysis, etc.) and 27 sub-directions; difficulty levels are divided into Level I (undergraduate-level, including real number limits, basics of differential and integral calculus, etc.) and Level II (PhD-level, including abstract measure, Banach space theory, etc.), each with 100 questions.

4

Section 04

Evaluation Results: Poor Performance of Large Models and Typical Failure Modes

Test results of multiple models show: GPT-5.5 (the best performer) has a 16% pass rate on Level I and only 5% on Level II, while most other models are close to 0%. Key failure modes: 1. Mathlib hallucinations (fictional lemmas, incorrect applications, type mismatches); 2. Incomplete proofs (skipping key steps, circular reasoning, missing boundary cases). These issues are particularly fatal in formal systems.

5

Section 05

Qualitative Gap Between Informal and Formal Reasoning

Comparative experiments found that models can provide seemingly reasonable arguments in the informal (natural language) version, but cannot convert them into verifiable proofs in the formal version. This indicates that large models may grasp the surface patterns of mathematical concepts but lack deep formal reasoning capabilities, and there is a qualitative gap between the two.

6

Section 06

Implications and Future Research Directions

The essence of current large model limitations: good at symbolic manipulation but lack semantic understanding, conflict between probabilistic nature and rigor, difficulty in long-range dependency reasoning. Future directions: 1. Technical improvements (Mathlib-aware training, fine-tuning for formal rigor, enhanced tool interaction, long-range reasoning architectures); 2. Evaluation ecosystem construction (more benchmarks for mathematical branches, progressive difficulty systems, comparison with human capabilities).

7

Section 07

Conclusion: The Milestone Significance of MA-ProofBench

MA-ProofBench is an important milestone in evaluating AI's mathematical capabilities, revealing that large models are far from reaching the level of human mathematicians. The 5% pass rate on PhD-level questions reminds us that deep mathematical reasoning remains an unbreached fortress. This benchmark provides researchers with improvement goals and also offers a realistic reference for developers and decision-makers—undergraduate-level problems have some usability, but advanced mathematics fields need to be treated with caution.