Section 01
Core Introduction: AutoMat Benchmark Reveals Limitations of Code Agents in Reproducing Computational Materials Science Discoveries
This article evaluates the reproducibility of LLM-based code agents in computational materials science using the AutoMat benchmark. The study finds that the success rate of the optimal configuration is only 54.1%, with key failure reasons including incomplete programs, method deviations, and execution fragility. The AutoMat benchmark is designed with three core challenges: recovering unspecified programs, navigating professional toolchains, and evidence evaluation. Its dataset is built from real research papers. The results provide a reality check for the AI for Science field, emphasizing the importance of domain knowledge and human-machine collaboration.