Section 01
Introduction: SoundnessBench Reveals Limitations in AI's Evaluation of Scientific Research Rigor
SoundnessBench is a benchmark for evaluating large language models' (LLMs) ability to judge the rigor of research methodologies. Its core finding is that current LLMs have a systemic optimistic bias—they tend to misjudge low-rigor research as rigorous. This warns that AI autonomous scientific research still requires human supervision and cannot independently ensure the quality of research proposals.