Section 01
StepSTEM: Revealing the True STEM Reasoning Capabilities of Multimodal Large Language Models (Introduction)
StepSTEM is a benchmark developed by teams from the University of California, Berkeley and Stanford University. It uses 283 rigorously selected graduate-level interdisciplinary questions (covering mathematics, physics, chemistry, biology, and engineering), enforces the complementarity of text and visual inputs, and introduces a step-level evaluation framework to reveal the true cross-modal reasoning capabilities of Multimodal Large Language Models (MLLMs). Test results show that even top-tier MLLMs (such as Gemini 3.1 Pro and Claude Opus 4.6) only achieve an accuracy rate of 38.29% on this benchmark, reflecting that current models still have significant shortcomings in true cross-modal reasoning.