Zing Forum

Reading

Can Code Agents Reproduce Discoveries in Computational Materials Science? Limitations Revealed by the AutoMat Benchmark

This article evaluates the reproducibility of LLM-based code agents in computational materials science using the AutoMat benchmark. The study finds that the success rate of the optimal configuration is only 54.1%, with key failure reasons including incomplete programs, method deviations, and execution fragility.

AutoMat代码智能体计算材料科学科学复现AI for Science基准测试领域特定工具科学工作流
Published 2026-05-02 01:42Recent activity 2026-05-04 10:24Estimated read 7 min
Can Code Agents Reproduce Discoveries in Computational Materials Science? Limitations Revealed by the AutoMat Benchmark
1

Section 01

Core Introduction: AutoMat Benchmark Reveals Limitations of Code Agents in Reproducing Computational Materials Science Discoveries

This article evaluates the reproducibility of LLM-based code agents in computational materials science using the AutoMat benchmark. The study finds that the success rate of the optimal configuration is only 54.1%, with key failure reasons including incomplete programs, method deviations, and execution fragility. The AutoMat benchmark is designed with three core challenges: recovering unspecified programs, navigating professional toolchains, and evidence evaluation. Its dataset is built from real research papers. The results provide a reality check for the AI for Science field, emphasizing the importance of domain knowledge and human-machine collaboration.

2

Section 02

Research Background: Essential Differences Between Computational Science Workflows and Software Engineering

LLM-based code agents perform well in software engineering benchmarks, but their ability to transfer to computational science workflows is questionable. The key differences between the two are: 1. Computational science requires following complex domain-specific experimental processes; 2. Results need to be interpreted in the context of scientific claims; 3. Proficiency in using specialized scientific computing tools (e.g., VASP, LAMMPS) is necessary.

3

Section 03

AutoMat Benchmark Design: Three Core Challenges and Dataset Construction

The AutoMat benchmark includes three core challenges: 1. Inferring complete computational steps (algorithms, parameters, preprocessing, dependencies) from paper text; 2. Correctly selecting/configuring professional toolchains (first-principles calculations, molecular dynamics, data analysis tools); 3. Evaluating whether computational results support scientific claims (statistical significance, error sources, evidence differentiation). The dataset is collaboratively built by domain experts based on real materials science papers, including original text, chart data, and expert-validated gold-standard reproduction schemes.

4

Section 04

Experimental Results: 54.1% Success Rate and Key Failure Modes

The reproduction success rate of code agents under the optimal configuration is 54.1%, with nearly half of the attempts failing. Analysis of failure modes: ~40% stem from incomplete programs (missing preprocessing, ignoring parameter tuning, unrecognized implicit dependencies); 35% from method deviations (wrong algorithm/model selection, inconsistent parameters); 25% from execution fragility (tool call errors, environment configuration issues, numerical stability problems). The most challenging scenario is reconstructing workflows solely from paper text, due to missing implicit knowledge, difficulty in ambiguity resolution, and insufficient context.

5

Section 05

Comparative Analysis of Models and Agent Configurations

Performance of different base models: GPT-4 series excels in code generation but lacks domain understanding; Claude series has obvious long-context advantages but needs improvement in tool usage accuracy; open-source models have a significant gap with proprietary models (especially in complex reasoning tasks). Impact of agent configurations: ReAct-style is transparent but has complex steps; separating planning and execution reduces intermediate errors; deep tool integration significantly improves success rates.

6

Section 06

Implications and Improvement Directions for AI for Science

Current limitations: Reports on AI-driven scientific discoveries are overly optimistic; pure code generation capabilities are insufficient for scientific tasks; human expert supervision remains indispensable. Improvement directions: 1. Enhance domain knowledge integration (specialized knowledge bases, domain templates, integration of physical/chemical constraints); 2. Improve tool usage capabilities (intelligent selection mechanisms, best practice libraries, error diagnosis and recovery); 3. Boost scientific reasoning (methodology understanding, statistical analysis, uncertainty quantification).

7

Section 07

Broader Impacts: Reproducibility, Education, and Policy Ethics

AutoMat is not only an AI benchmark but also a tool for evaluating the reproducibility of papers, revealing that many papers have incomplete computational descriptions. In education: Scientists need to be trained to write reproducible workflows, use AI tools appropriately, and verify automated outputs. In policy and ethics: Transparency of AI-assisted research must be ensured, result validation standards established, and efficiency and reliability balanced.