Zing Forum

Reading

Evaluation of Quantitative Reasoning Ability of Large Language Models in Indoor Air Engineering: A Groundbreaking Benchmark Study

A research team from VinUniversity (Vietnam) and the University of Illinois (USA) has published a systematic evaluation study on the quantitative reasoning ability of large language models in the field of indoor air quality engineering, testing several mainstream models including GPT-4.1, Claude 3.7 Sonnet, Gemini 2.5 Pro, etc.

大语言模型室内空气品质定量推理基准测试环境工程AI评估GPT-4ClaudeGemini工程应用
Published 2026-04-01 03:11Recent activity 2026-04-01 03:17Estimated read 7 min
Evaluation of Quantitative Reasoning Ability of Large Language Models in Indoor Air Engineering: A Groundbreaking Benchmark Study
1

Section 01

[Introduction] Benchmark Study on Quantitative Reasoning Ability of Large Language Models in Indoor Air Engineering

A research team from institutions including VinUniversity (Vietnam) and the University of Illinois (USA) conducted a systematic evaluation of the quantitative reasoning ability of Large Language Models (LLMs) in the field of Indoor Air Quality (IAQ) engineering. The study tested multiple mainstream models such as GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.5 Pro, constructed a dataset containing 480 professional questions, and compared the effects of general prompts (NSD) and domain-specific prompts (IAQ). The results revealed significant differences in the performance of different models and the importance of domain knowledge in improving reasoning ability, providing key references for the application of AI in the field of environmental engineering.

2

Section 02

Research Background and Significance: Filling the Gap in LLM Quantitative Reasoning Evaluation in Professional Engineering Fields

With the development of AI technology, LLMs have performed prominently in many fields, but research on their quantitative reasoning ability in professional engineering fields is insufficient. IAQ engineering involves multiple disciplines such as building environment and fluid mechanics, which has high requirements for the model's professional knowledge and computing ability. This study, jointly conducted by scholars from multiple institutions, fills the gap in this field and provides a reference basis for the application of AI in environmental engineering.

3

Section 03

Research Methods: Dataset Construction, Model Selection, and Prompt Strategy

Dataset Construction

Carefully constructed 480 quantitative reasoning questions covering core IAQ fields such as ventilation design, pollutant diffusion, and air purification efficiency.

Model Selection

Tested mainstream models: OpenAI (GPT-4.1), Anthropic (Claude 3.7 Sonnet), Google (Gemini 2.5 Pro), Baidu Wenxin (ERNIE-4.5-300B-A47B), Meta (Llama 4 Scout), Mistral AI (Mistral Large 2), DeepSeek (DeepSeek-R1-0528), xAI (Grok 3).

Prompt Strategy

Compared two types of prompts: 1. NSD prompts (standard method for general domains); 2. IAQ prompts (domain-specific prompts for IAQ), and analyzed the impact of domain knowledge on model performance.

4

Section 04

Key Findings: Differences in Model Performance and the Importance of Domain Knowledge

  1. Differences in Model Performance: Different LLMs show significant differences in IAQ quantitative reasoning ability, reflected in dimensions such as answer accuracy, logical rigor, formula application, and unit conversion.
  2. Value of Domain Knowledge: The accuracy and problem-solving quality of models under IAQ professional prompts were significantly improved, proving the importance of domain-specific knowledge.
  3. Analysis of Failure Cases: Models have limitations such as insufficient application of complex formulas, broken logic in multi-step reasoning, and misunderstanding of professional terms.
5

Section 05

Practical Application Value: Implications for Engineering Education, Industrial Applications, and Future Research

Engineering Education

Helps educators design courses, make rational use of AI-assisted teaching, and cultivate students' independent thinking ability.

Industrial Applications

Guides practitioners to understand the applicable boundaries of AI; LLMs can be used as auxiliary tools, but key decisions still need to be verified by human experts.

Future Research

The method framework can be extended to other engineering fields, and the model's limitations point the way for future improvements.

6

Section 06

Technical Details: Reproducible Open-Source Architecture and Experimental Execution Process

Open-Source Code Architecture

Adopted OOP design; core components include data loading (CSV), model interface (OpenRouter API), inference execution (batch + repeated experiments), and result storage (Markdown).

Experimental Execution

Recommended the Google Colab Pro+ platform for reasons including computing resource requirements, convenient cloud storage, and high cost-effectiveness. Process: Configure API key → Select model → Set output path → Launch automated testing (5 repetitions to ensure statistical reliability).

Environment Configuration

Requires OpenRouter API key, Google Drive space, CSV dataset template, and Python script running environment.

7

Section 07

Summary and Outlook: Opportunities and Challenges of AI Application in Engineering Fields

This study is the first to systematically evaluate the quantitative reasoning ability of LLMs in IAQ engineering, providing empirical data for the application of AI in professional engineering fields. LLMs will be more widely applied in the future, but we need to clearly recognize their limitations and maintain the core role of human experts. Open-source code and documentation lay the foundation for subsequent research and promote the development of interdisciplinary AI evaluation research.