Reading

StepSTEM: Revealing the True STEM Reasoning Capabilities of Multimodal Large Language Models

StepSTEM uses 283 rigorously selected graduate-level interdisciplinary questions, enforces the complementarity of text and visual inputs, and introduces a step-level evaluation framework to reveal that current MLLMs only achieve a 38% accuracy rate in true cross-modal reasoning.

多模态推理STEM基准测试步骤级评估跨模态理解MLLMs模型评估

Published 2026-04-22 01:17Recent activity 2026-04-22 12:19Estimated read 7 min

Section 01

StepSTEM: Revealing the True STEM Reasoning Capabilities of Multimodal Large Language Models (Introduction)

StepSTEM is a benchmark developed by teams from the University of California, Berkeley and Stanford University. It uses 283 rigorously selected graduate-level interdisciplinary questions (covering mathematics, physics, chemistry, biology, and engineering), enforces the complementarity of text and visual inputs, and introduces a step-level evaluation framework to reveal the true cross-modal reasoning capabilities of Multimodal Large Language Models (MLLMs). Test results show that even top-tier MLLMs (such as Gemini 3.1 Pro and Claude Opus 4.6) only achieve an accuracy rate of 38.29% on this benchmark, reflecting that current models still have significant shortcomings in true cross-modal reasoning.

Section 02

Background: Two Major Blind Spots in Existing Multimodal Reasoning Evaluations

Current MLLMs perform well in various tasks, but existing evaluations in the STEM field have serious flaws: 1. Modal redundancy trap: Many questions allow solutions using only text or images (single modality) without true cross-modal understanding; 2. Result-oriented bias: Only focuses on whether the final answer is correct, ignoring the quality of the reasoning process. This leads to models possibly "cheating" to get high scores, misleading judgments about their true capabilities.

Section 03

Methodology: Core Design Principles of StepSTEM

The design of StepSTEM revolves around three core principles: 1. Strict modal complementarity: Each question requires a combination of text and images to solve; it cannot be correctly completed using a single modality; 2. Graduate-level difficulty: Questions are sourced from university coursework, graduate exams, and professional certifications, covering five disciplines; 3. Dynamic alignment of multiple reference solutions: Each question is paired with multiple manually verified reference solutions. During evaluation, a dynamic programming algorithm is used to align the model's reasoning steps, calculating step matching degree instead of simple string matching.

Section 04

Methodology: Innovations in the Step-Level Evaluation Framework

StepSTEM proposes a general step-level evaluation framework that supports two reasoning modes: 1. Pure text chain-of-thought evaluation: Split the reasoning text into logical steps, mark them as correct/partially correct/incorrect to reveal weak links; 2. Image-text interleaved reasoning evaluation: Identify the relevance of image regions referenced by the model, the effectiveness of generated intermediate images, and the consistency between text and visual content, quantifying MLLMs' ability to "think by looking at images" and "explain by drawing" for the first time.

Section 05

Evidence: Model Shortcomings Revealed by Experimental Results

Tests on mainstream models such as GPT-4V, Gemini 3.1 Pro, and Claude Opus 4.6 found: 1. Overall performance: Top-tier models only achieve an accuracy rate of 38.29%; 2. Interdisciplinary differences: Mathematics performs relatively best, while biology and engineering perform the worst; 3. Reasoning process issues: Insufficient visual dependency (tendency to use only text), frequent hallucinations (25% of wrong answers contain statements inconsistent with images), and broken reasoning chains (an average of 2.3 logical breakpoints per wrong answer).

Section 06

Implications: Guiding Significance for Multimodal AI Research

The results of StepSTEM bring three implications to the research community: 1. Recalibrate capability expectations: Current MLLMs still have a long way to go to achieve true cross-modal reasoning; 2. Guide model improvements: Need to enhance visual grounding capabilities, improve multimodal attention mechanisms, and introduce reasoning verification steps to reduce hallucinations; 3. Promote the upgrade of evaluation standards: Future evaluations should enforce modal complementarity, focus on the reasoning process, and provide multi-dimensional feedback.

Section 07

Limitations and Future Work

StepSTEM has limitations: 1. Small scale (283 questions), limited coverage of subfields; 2. Only supports English; 3. Static dataset cannot test interactive reasoning. The team plans to expand the question bank through crowdsourcing and explore combining interactive environments to test the model's reasoning ability in scenarios where questions can be asked.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49