# StepSTEM: Revealing the True STEM Reasoning Capabilities of Multimodal Large Language Models

> StepSTEM uses 283 rigorously selected graduate-level interdisciplinary questions, enforces the complementarity of text and visual inputs, and introduces a step-level evaluation framework to reveal that current MLLMs only achieve a 38% accuracy rate in true cross-modal reasoning.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-21T17:17:37.000Z
- 最近活动: 2026-04-22T04:19:33.001Z
- 热度: 138.0
- 关键词: 多模态推理, STEM, 基准测试, 步骤级评估, 跨模态理解, MLLMs, 模型评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/stepstem-stem
- Canonical: https://www.zingnex.cn/forum/thread/stepstem-stem
- Markdown 来源: floors_fallback

---

## StepSTEM: Revealing the True STEM Reasoning Capabilities of Multimodal Large Language Models (Introduction)

StepSTEM is a benchmark developed by teams from the University of California, Berkeley and Stanford University. It uses 283 rigorously selected graduate-level interdisciplinary questions (covering mathematics, physics, chemistry, biology, and engineering), enforces the complementarity of text and visual inputs, and introduces a step-level evaluation framework to reveal the true cross-modal reasoning capabilities of Multimodal Large Language Models (MLLMs). Test results show that even top-tier MLLMs (such as Gemini 3.1 Pro and Claude Opus 4.6) only achieve an accuracy rate of 38.29% on this benchmark, reflecting that current models still have significant shortcomings in true cross-modal reasoning.

## Background: Two Major Blind Spots in Existing Multimodal Reasoning Evaluations

Current MLLMs perform well in various tasks, but existing evaluations in the STEM field have serious flaws: 1. Modal redundancy trap: Many questions allow solutions using only text or images (single modality) without true cross-modal understanding; 2. Result-oriented bias: Only focuses on whether the final answer is correct, ignoring the quality of the reasoning process. This leads to models possibly "cheating" to get high scores, misleading judgments about their true capabilities.

## Methodology: Core Design Principles of StepSTEM

The design of StepSTEM revolves around three core principles: 1. Strict modal complementarity: Each question requires a combination of text and images to solve; it cannot be correctly completed using a single modality; 2. Graduate-level difficulty: Questions are sourced from university coursework, graduate exams, and professional certifications, covering five disciplines; 3. Dynamic alignment of multiple reference solutions: Each question is paired with multiple manually verified reference solutions. During evaluation, a dynamic programming algorithm is used to align the model's reasoning steps, calculating step matching degree instead of simple string matching.

## Methodology: Innovations in the Step-Level Evaluation Framework

StepSTEM proposes a general step-level evaluation framework that supports two reasoning modes: 1. Pure text chain-of-thought evaluation: Split the reasoning text into logical steps, mark them as correct/partially correct/incorrect to reveal weak links; 2. Image-text interleaved reasoning evaluation: Identify the relevance of image regions referenced by the model, the effectiveness of generated intermediate images, and the consistency between text and visual content, quantifying MLLMs' ability to "think by looking at images" and "explain by drawing" for the first time.

## Evidence: Model Shortcomings Revealed by Experimental Results

Tests on mainstream models such as GPT-4V, Gemini 3.1 Pro, and Claude Opus 4.6 found: 1. Overall performance: Top-tier models only achieve an accuracy rate of 38.29%; 2. Interdisciplinary differences: Mathematics performs relatively best, while biology and engineering perform the worst; 3. Reasoning process issues: Insufficient visual dependency (tendency to use only text), frequent hallucinations (25% of wrong answers contain statements inconsistent with images), and broken reasoning chains (an average of 2.3 logical breakpoints per wrong answer).

## Implications: Guiding Significance for Multimodal AI Research

The results of StepSTEM bring three implications to the research community: 1. Recalibrate capability expectations: Current MLLMs still have a long way to go to achieve true cross-modal reasoning; 2. Guide model improvements: Need to enhance visual grounding capabilities, improve multimodal attention mechanisms, and introduce reasoning verification steps to reduce hallucinations; 3. Promote the upgrade of evaluation standards: Future evaluations should enforce modal complementarity, focus on the reasoning process, and provide multi-dimensional feedback.

## Limitations and Future Work

StepSTEM has limitations: 1. Small scale (283 questions), limited coverage of subfields; 2. Only supports English; 3. Static dataset cannot test interactive reasoning. The team plans to expand the question bank through crowdsourcing and explore combining interactive environments to test the model's reasoning ability in scenarios where questions can be asked.
