# Can Code Agents Reproduce Discoveries in Computational Materials Science? Limitations Revealed by the AutoMat Benchmark

> This article evaluates the reproducibility of LLM-based code agents in computational materials science using the AutoMat benchmark. The study finds that the success rate of the optimal configuration is only 54.1%, with key failure reasons including incomplete programs, method deviations, and execution fragility.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-01T17:42:12.000Z
- 最近活动: 2026-05-04T02:24:36.416Z
- 热度: 94.3
- 关键词: AutoMat, 代码智能体, 计算材料科学, 科学复现, AI for Science, 基准测试, 领域特定工具, 科学工作流
- 页面链接: https://www.zingnex.cn/en/forum/thread/automat
- Canonical: https://www.zingnex.cn/forum/thread/automat
- Markdown 来源: floors_fallback

---

## Core Introduction: AutoMat Benchmark Reveals Limitations of Code Agents in Reproducing Computational Materials Science Discoveries

This article evaluates the reproducibility of LLM-based code agents in computational materials science using the AutoMat benchmark. The study finds that the success rate of the optimal configuration is only 54.1%, with key failure reasons including incomplete programs, method deviations, and execution fragility. The AutoMat benchmark is designed with three core challenges: recovering unspecified programs, navigating professional toolchains, and evidence evaluation. Its dataset is built from real research papers. The results provide a reality check for the AI for Science field, emphasizing the importance of domain knowledge and human-machine collaboration.

## Research Background: Essential Differences Between Computational Science Workflows and Software Engineering

LLM-based code agents perform well in software engineering benchmarks, but their ability to transfer to computational science workflows is questionable. The key differences between the two are: 1. Computational science requires following complex domain-specific experimental processes; 2. Results need to be interpreted in the context of scientific claims; 3. Proficiency in using specialized scientific computing tools (e.g., VASP, LAMMPS) is necessary.

## AutoMat Benchmark Design: Three Core Challenges and Dataset Construction

The AutoMat benchmark includes three core challenges: 1. Inferring complete computational steps (algorithms, parameters, preprocessing, dependencies) from paper text; 2. Correctly selecting/configuring professional toolchains (first-principles calculations, molecular dynamics, data analysis tools); 3. Evaluating whether computational results support scientific claims (statistical significance, error sources, evidence differentiation). The dataset is collaboratively built by domain experts based on real materials science papers, including original text, chart data, and expert-validated gold-standard reproduction schemes.

## Experimental Results: 54.1% Success Rate and Key Failure Modes

The reproduction success rate of code agents under the optimal configuration is 54.1%, with nearly half of the attempts failing. Analysis of failure modes: ~40% stem from incomplete programs (missing preprocessing, ignoring parameter tuning, unrecognized implicit dependencies); 35% from method deviations (wrong algorithm/model selection, inconsistent parameters); 25% from execution fragility (tool call errors, environment configuration issues, numerical stability problems). The most challenging scenario is reconstructing workflows solely from paper text, due to missing implicit knowledge, difficulty in ambiguity resolution, and insufficient context.

## Comparative Analysis of Models and Agent Configurations

Performance of different base models: GPT-4 series excels in code generation but lacks domain understanding; Claude series has obvious long-context advantages but needs improvement in tool usage accuracy; open-source models have a significant gap with proprietary models (especially in complex reasoning tasks). Impact of agent configurations: ReAct-style is transparent but has complex steps; separating planning and execution reduces intermediate errors; deep tool integration significantly improves success rates.

## Implications and Improvement Directions for AI for Science

Current limitations: Reports on AI-driven scientific discoveries are overly optimistic; pure code generation capabilities are insufficient for scientific tasks; human expert supervision remains indispensable. Improvement directions: 1. Enhance domain knowledge integration (specialized knowledge bases, domain templates, integration of physical/chemical constraints); 2. Improve tool usage capabilities (intelligent selection mechanisms, best practice libraries, error diagnosis and recovery); 3. Boost scientific reasoning (methodology understanding, statistical analysis, uncertainty quantification).

## Broader Impacts: Reproducibility, Education, and Policy Ethics

AutoMat is not only an AI benchmark but also a tool for evaluating the reproducibility of papers, revealing that many papers have incomplete computational descriptions. In education: Scientists need to be trained to write reproducible workflows, use AI tools appropriately, and verify automated outputs. In policy and ethics: Transparency of AI-assisted research must be ensured, result validation standards established, and efficiency and reliability balanced.