# New Benchmark for Mathematical Analysis Theorem Proving: MA-ProofBench Reveals Shortcomings of Large Models in Advanced Mathematical Reasoning

> The first formal theorem proving benchmark for mathematical analysis, MA-ProofBench, has been released. It covers 200 theorems in measure theory, complex analysis, functional analysis, and other areas. Tests show that even GPT-5.5 only has a 5% pass rate on PhD-level questions, exposing the significant limitations of current large models in formal reasoning for advanced mathematics.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-11T18:00:04.000Z
- 最近活动: 2026-06-15T02:51:49.693Z
- 热度: 79.0
- 关键词: 形式化数学, 定理证明, 数学分析, 基准测试, 大模型评估, Lean, Mathlib, 高等数学
- 页面链接: https://www.zingnex.cn/en/forum/thread/ma-proofbench
- Canonical: https://www.zingnex.cn/forum/thread/ma-proofbench
- Markdown 来源: floors_fallback

---

## MA-ProofBench Benchmark Released: Revealing Shortcomings of Large Models in Formal Reasoning for Advanced Mathematics

The first formal theorem proving benchmark specifically for mathematical analysis, MA-ProofBench, has been officially released. It covers 200 theorems in measure theory, complex analysis, functional analysis, and other areas (100 undergraduate-level questions and 100 PhD qualifying exam-level questions). Tests show that even the top-performing GPT-5.5 only has a 5% pass rate on PhD-level questions, exposing the significant limitations of current large models in formal reasoning for advanced mathematics. This benchmark fills the gap in existing evaluations in the field of mathematical analysis and provides an important reference for assessing AI's mathematical capabilities.

## Background: Existing Limitations of Formal Mathematical Reasoning

While large language models have made progress in automated theorem proving, existing benchmarks have two major limitations: 1. Narrow coverage of mathematics, mostly focusing on easily formalizable areas like algebra and elementary number theory; 2. Insufficient difficulty levels, with limited coverage of branches requiring deep reasoning such as mathematical analysis. As the cornerstone of modern mathematics, formal reasoning in mathematical analysis demands higher standards from AI, and existing benchmarks cannot accurately evaluate the true capabilities of large models in advanced mathematics.

## Details of the MA-ProofBench Benchmark

MA-ProofBench is the first formal theorem proving benchmark for mathematical analysis, constructed under human leadership with LLM assistance and reviewed by experts. Content composition: 200 formal theorems covering 6 core topics (measure theory, complex analysis, etc.) and 27 sub-directions; difficulty levels are divided into Level I (undergraduate-level, including real number limits, basics of differential and integral calculus, etc.) and Level II (PhD-level, including abstract measure, Banach space theory, etc.), each with 100 questions.

## Evaluation Results: Poor Performance of Large Models and Typical Failure Modes

Test results of multiple models show: GPT-5.5 (the best performer) has a 16% pass rate on Level I and only 5% on Level II, while most other models are close to 0%. Key failure modes: 1. Mathlib hallucinations (fictional lemmas, incorrect applications, type mismatches); 2. Incomplete proofs (skipping key steps, circular reasoning, missing boundary cases). These issues are particularly fatal in formal systems.

## Qualitative Gap Between Informal and Formal Reasoning

Comparative experiments found that models can provide seemingly reasonable arguments in the informal (natural language) version, but cannot convert them into verifiable proofs in the formal version. This indicates that large models may grasp the surface patterns of mathematical concepts but lack deep formal reasoning capabilities, and there is a qualitative gap between the two.

## Implications and Future Research Directions

The essence of current large model limitations: good at symbolic manipulation but lack semantic understanding, conflict between probabilistic nature and rigor, difficulty in long-range dependency reasoning. Future directions: 1. Technical improvements (Mathlib-aware training, fine-tuning for formal rigor, enhanced tool interaction, long-range reasoning architectures); 2. Evaluation ecosystem construction (more benchmarks for mathematical branches, progressive difficulty systems, comparison with human capabilities).

## Conclusion: The Milestone Significance of MA-ProofBench

MA-ProofBench is an important milestone in evaluating AI's mathematical capabilities, revealing that large models are far from reaching the level of human mathematicians. The 5% pass rate on PhD-level questions reminds us that deep mathematical reasoning remains an unbreached fortress. This benchmark provides researchers with improvement goals and also offers a realistic reference for developers and decision-makers—undergraduate-level problems have some usability, but advanced mathematics fields need to be treated with caution.