Zing Forum

Reading

SYNTEXIS: A New Benchmark for Automatic Formalization and Execution of Mathematical Reasoning in Large Models

SYNTEXIS is a benchmark framework for evaluating the automatic formalization capability of mathematical reasoning in large language models (LLMs). It uses the chain-of-thought method to verify models' ability to convert natural language mathematical problems into executable formalized code.

自动形式化数学推理大语言模型定理证明LeanCoq思维链基准测试
Published 2026-04-09 10:33Recent activity 2026-04-09 10:47Estimated read 6 min
SYNTEXIS: A New Benchmark for Automatic Formalization and Execution of Mathematical Reasoning in Large Models
1

Section 01

SYNTEXIS Benchmark: An Evaluation Framework for Automatic Formalization Capability of Mathematical Reasoning in Large Models

SYNTEXIS is a benchmark framework focused on evaluating the automatic formalization capability of mathematical reasoning in large language models (LLMs). It uses the chain-of-thought method to verify models' ability to convert natural language mathematical problems into formalized code executable in theorem provers (e.g., Lean, Coq). It emphasizes end-to-end evaluation (verifying code executability) and multi-dimensional metrics, providing a standardized measurement tool for this field.

2

Section 02

Background: Challenges of Mathematical Formalization and Opportunities for LLMs

The formalization process of mathematical reasoning has a very high threshold and requires years of professional training. With the rise of LLMs, a core question emerges: Can AI automatically convert natural language mathematical problems into strictly formalized code? This is a key step in automated mathematical reasoning, leading to the birth of the SYNTEXIS project, which aims to solve this problem and provide an evaluation benchmark.

3

Section 03

Core Design Principles of SYNTEXIS

SYNTEXIS follows three core design principles:

  1. End-to-end evaluation: Not only checks code syntax but also verifies its execution and validation results in theorem provers to avoid false successes;
  2. Chain-of-thought method: Encourages models to perform step-by-step mathematical reasoning first before generating formalized code, simulating the human problem-solving process;
  3. Diversified mathematical domains: Covers branches like algebra, geometry, number theory, etc., reflecting the model's performance differences across domains.
4

Section 04

Technical Implementation: Dataset, Execution Environment, and Evaluation Pipeline

SYNTEXIS's technical architecture includes:

  • Dataset construction: Collaboratively built by math experts and formalization experts, containing natural language descriptions, reference formalized code, and metadata (difficulty, domain, etc.);
  • Execution environment: Containerized deployment for dependency isolation, with timeout mechanisms and resource limits to ensure reproducibility;
  • Evaluation pipeline: Automatically receives input, extracts code, executes verification, collects results, and generates reports.
5

Section 05

Multi-dimensional Evaluation: Comprehensive Measurement of Model Capabilities

SYNTEXIS provides three evaluation dimensions:

  1. Formalization success rate: Checks code syntax correctness, type matching, and proof completeness;
  2. Reasoning quality: Analyzes the rationality of steps, appropriateness of strategies, and error recovery ability in the chain-of-thought;
  3. Cross-language generalization: Evaluates the model's generalization ability between different theorem provers (e.g., Lean and Coq).
6

Section 06

Application Scenarios and Current Limitations

Application scenarios:

  • Model developers: Use the standardized benchmark to measure technical progress;
  • Formalization community: AI assistance to accelerate proof development;
  • Education system: Help students understand mathematical rigor.

Limitations:

  • Formalization diversity: Multiple formalizations of the same concept require flexible evaluation criteria;
  • Prover differences: Library ecosystems and syntax of different tools need independent evaluation standards;
  • Computational resources: Formalized code execution is resource-intensive, limiting the scalability of large-scale evaluations.
7

Section 07

Future Outlook: Towards the Direction of an Automatic Mathematician

SYNTEXIS is an important milestone. Future directions include:

  1. Stronger model capabilities: Enhance automatic formalization ability and actively discover proof strategies;
  2. Interactive formalization: Human-machine collaboration to alternately contribute proof steps;
  3. Reverse conversion: Convert formalized proofs into natural language explanations to improve accessibility.