Zing Forum

Reading

NucBench: The First Multimodal Large Language Model Evaluation Benchmark for Nuclear Engineering

NucBench is the first open-source multimodal large language model evaluation benchmark designed specifically for nuclear engineering application scenarios. It includes approximately 4292 multiple-choice questions from the Reactor Operator License Exam (GFE), over 100 mixed-type questions from undergraduate nuclear engineering exams, and a two-phase flow regime image recognition dataset, providing a standardized test to evaluate LLMs' knowledge mastery and reasoning abilities in the professional engineering field.

NucBench核工程LLM评测多模态基准测试反应堆热工水力两相流GFE核电站
Published 2026-05-11 18:54Recent activity 2026-05-11 19:03Estimated read 5 min
NucBench: The First Multimodal Large Language Model Evaluation Benchmark for Nuclear Engineering
1

Section 01

NucBench: Introduction to the First Multimodal LLM Evaluation Benchmark for Nuclear Engineering

NucBench is the first open-source multimodal large language model evaluation benchmark for the nuclear engineering field, developed by the team from the University of Sharjah. It includes approximately 4292 multiple-choice questions from the Reactor Operator License Exam (GFE), over 100 mixed-type questions from undergraduate nuclear engineering exams, and a two-phase flow regime image recognition dataset, aiming to provide a standardized test for evaluating LLMs' knowledge mastery and reasoning abilities in the nuclear engineering field.

2

Section 02

Challenges of AI Applications in Nuclear Engineering and Limitations of Existing Benchmarks

Nuclear engineering is a highly specialized field with extremely high safety requirements, involving complex knowledge systems such as reactor physics and thermal-hydraulics. Existing general evaluation benchmarks (e.g., MMLU, GSM8K) lack in-depth coverage of professional engineering fields. Nuclear engineering requires models to have abilities like solving quantitative calculations and understanding visual information, hence NucBench came into being.

3

Section 03

Core Composition of the NucBench Evaluation Dataset

It includes three types of tasks: 1. GFE Exam: Approximately 4292 multiple-choice questions from the U.S. NRC, covering PWR/BWR reactor types; 2. Undergraduate Nuclear Engineering Exams: Over 100 mixed-type questions covering 6 core subfields such as reactor thermal-hydraulics and physics; 3. Two-phase Flow Regime Image Recognition: From the Texas A&M University dataset, including 4 flow regime categories like bubbly flow and slug flow.

4

Section 04

Evaluation Objectives and Dimensions of NucBench

The objective is to comprehensively evaluate the abilities of multimodal LLMs in the nuclear engineering field, such as knowledge breadth, reasoning depth, multimodal understanding, professional context adaptation, and numerical accuracy, covering comprehensive assessment from basic physics to engineering practice.

5

Section 05

Engineering Significance and Application Prospects of NucBench

It fills the gap in LLM evaluation for professional engineering fields. It is valuable for model developers (standardized testing platform), practitioners (reliability evaluation of AI tools), educational institutions (AI-assisted teaching benchmark), and safety assessment (preliminary screening mechanism), providing a reference for benchmark development in other engineering fields.

6

Section 06

Limitations and Future Directions of NucBench

Currently, there are issues such as small question scale, limited question types (mainly multiple-choice), and insufficient field coverage (focusing on reactor engineering). In the future, it can expand the question scale, add open-ended questions/auto-scoring question types, cover fields like nuclear fuel cycle, and update regularly.

7

Section 07

Dataset Structure and Usage Instructions of NucBench

The dataset has a clear structure. The code repository includes directories such as exams, images, and docs. It uses the CC BY 4.0 license, allowing free use, modification, and redistribution, promoting collaboration and reproducibility in nuclear engineering AI research.