Zing Forum

Reading

MathNet: The World's Largest Multilingual Mathematical Reasoning and Retrieval Benchmark Dataset Released

The MIT research team released the MathNet benchmark, which covers 30,676 Olympiad-level math problems from 47 countries in 17 languages. It is the first to systematically evaluate large models' mathematical retrieval capabilities and found that retrieval quality significantly impacts reasoning performance.

数学推理基准测试多语言数据集检索增强奥赛数学大语言模型评估
Published 2026-04-21 01:59Recent activity 2026-04-21 11:48Estimated read 5 min
MathNet: The World's Largest Multilingual Mathematical Reasoning and Retrieval Benchmark Dataset Released
1

Section 01

MathNet Benchmark Dataset Released: The World's Largest Multilingual Mathematical Reasoning and Retrieval Evaluation Platform

The MIT research team released the MathNet benchmark dataset, which is the world's largest multilingual mathematical reasoning and retrieval benchmark. It covers 30,676 Olympiad-level math problems from 47 countries in 17 languages, and for the first time systematically evaluates large models' mathematical retrieval capabilities, finding that retrieval quality significantly impacts reasoning performance. The release of this benchmark marks a new stage in mathematical AI evaluation.

2

Section 02

Mathematical Reasoning: A Key Test of Large Models' Capabilities and Limitations of Existing Benchmarks

Mathematical problem-solving is the gold standard for testing large language models' reasoning abilities, requiring strict logic, symbolic operations, and coherent cross-step thinking. However, existing mathematical benchmarks have limitations in scale, language coverage, and task diversity, making it difficult to fully evaluate models' performance in real-world scenarios.

3

Section 03

MathNet Dataset: Balancing Scale and Quality

The MathNet dataset has an impressive scale, covering Olympiad-level math problems from 47 countries in 17 languages over a 20-year period, with a total of 30,676 expert-written problems and detailed solutions. Its diversity is reflected in covering fields such as algebra, geometry, number theory, and combinatorics, and each problem's solution provides a reference for model training and evaluation.

4

Section 04

Three Core Tasks of MathNet: Comprehensive Evaluation of Mathematical Reasoning and Retrieval Capabilities

MathNet designs three core tasks:

  1. Problem Solving Task: Tests end-to-end reasoning ability. Cutting-edge models like Gemini-3.1-Pro achieve an accuracy of 78.4%, while GPT-5 reaches 69.3%;
  2. Math-Aware Retrieval Task: For the first time systematically evaluates the ability to retrieve mathematically equivalent and structurally similar problems, where existing embedding models perform poorly;
  3. Retrieval-Augmented Problem Solving: Explores the impact of retrieval quality on reasoning. DeepSeek-V3.2-Speciale improves performance by 12% through high-quality retrieval.
5

Section 05

Experimental Findings: Cutting-Edge Models Still Have Room for Improvement, Retrieval Augmentation Is Highly Valuable

Experimental results show that even the most advanced reasoning models still have room for improvement on Olympiad-level problems (with a maximum accuracy of 78.4%). Meanwhile, retrieval augmentation significantly impacts mathematical reasoning performance; DeepSeek-V3.2-Speciale achieved a 12% performance improvement through high-quality retrieval, proving the importance of external knowledge bases.

6

Section 06

Open Source Contribution: MathNet Empowers Mathematical AI Research and Applications

The MathNet team has open-sourced the dataset and benchmark tools (URL: https://mathnet.mit.edu), providing a fair and comprehensive evaluation platform for academia and industry. For researchers, it offers multilingual resources; for educators, it can serve as the content foundation for intelligent education systems; for model developers, fine-grained evaluation helps identify strengths and weaknesses.

7

Section 07

Future Outlook: Evolution Direction of Mathematical AI Evaluation Paradigms

The release of MathNet represents the evolution of mathematical AI evaluation paradigms, expanding from single problem-solving to comprehensive evaluation of retrieval capabilities and retrieval-augmented reasoning. In the future, combining multimodal large language models with high-quality datasets like MathNet is expected to achieve greater breakthroughs in the field of automatic mathematical reasoning.