Zing Forum

Reading

Can Multimodal Large Models Understand Petroleum Engineering Drawings? A Practical Test of 6 Cutting-Edge Models Including GPT-5.5 and Claude

A benchmark test on the performance of vision-language models in the petroleum engineering field shows that GPT-5.5 and Claude-Opus-4.7 have reached a level close to domain experts in interpreting professional charts, but still have significant gaps in specialized tasks such as seismic facies analysis.

多模态大模型视觉语言模型石油工程基准测试GPT-5.5ClaudeGeminiGrokQwen领域应用
Published 2026-05-15 05:43Recent activity 2026-05-15 05:47Estimated read 8 min
Can Multimodal Large Models Understand Petroleum Engineering Drawings? A Practical Test of 6 Cutting-Edge Models Including GPT-5.5 and Claude
1

Section 01

Core Conclusions of the Benchmark Test on Petroleum Engineering Drawing Interpretation Capabilities of Cutting-Edge Multimodal Large Models

A benchmark test (ellm-multimodal-benchmark) evaluating the performance of vision-language models in the petroleum engineering field shows that GPT-5.5 and Claude-Opus-4.7 have reached a level close to domain experts in general chart interpretation and reasoning tasks, but still have significant gaps in specialized sub-tasks such as seismic facies analysis. This test covers 6 cutting-edge models and provides important references for AI applications in petroleum engineering.

2

Section 02

Test Background: The Intersection of AI and Petroleum Engineering

Petroleum engineering involves complex technologies such as seismic exploration and well logging analysis, where engineers need to interpret a large number of professional charts (e.g., seismic profiles, well logging curves). The long-standing assumption is that general-purpose vision-language models (VLMs) can only describe the surface content of charts and cannot perform technical interpretation or domain reasoning. This test aims to verify whether this assumption holds.

3

Section 03

Test Methodology and Dataset

ellm-multimodal-benchmark is an end-to-end evaluation framework developed by jalalirs. The methodology includes: collecting real charts from arXiv geophysics papers and Wikimedia Commons, filtering and classifying them via VLMs, generating QA pairs by experts, blind-testing 6 models through OpenRouter, and then having Claude-Sonnet-4.6 score independently. The dataset contains 123 items and 12 chart types, with questions divided into three difficulty levels: descriptive, explanatory, and inferential.

4

Section 04

Overall Performance Comparison of Models

The test panel includes 6 models: GPT-5.5, Claude-Opus-4.7, Gemini-3.1-Pro-preview, Gemini-2.5-Pro, Grok-4.3, and Qwen3-VL-235B. A 3-point scoring system was used, and the results are as follows:

Model Score Rate Expert Pass Rate (≥2/3) Hallucination Rate
GPT-5.5 90.0% 92.7% 12.2%
Claude-Opus-4.7 84.6% 88.6% 25.2%
Gemini-3.1-Pro 81.1%* 88.9% 27.8%
Grok-4.3 75.3% 82.1% 38.2%
Gemini-2.5-Pro 75.3% 84.6% 40.7%
Qwen3-VL-235B 67.8% 75.6% 52.0%

*Gemini-3.1-Pro only completed 90/123 items due to API limitations

Hallucination rate is highly negatively correlated with overall score; the stronger the model, the fewer hallucinations.

5

Section 05

Key Insights: Strengths and Weaknesses of Model Capabilities

  1. The "surface-only description" assumption is invalid: GPT-5.5 and Claude-Opus-4.7 are close to domain experts in general chart interpretation and reasoning tasks (score rate 85-90%), with minimal gaps between scores for descriptive and multi-step reasoning questions.
  2. Gaps remain in specialized sub-tasks: In specialized tasks such as seismic facies analysis (e.g., F3 seismic facies identification of stratigraphic units, counting facies types), the best model GPT-5.5 only scored 2.17/3, while other models scored around 1.8-1.9/3; performance in tasks like composite well logging curve interpretation is also poor.
  3. Open-source models need to catch up: Qwen3-VL-235B, as the strongest open-source model, scores about 0.7 points lower than top closed-source models and has a 4x higher hallucination rate. There is significant room for domain adaptation, but the fundamental gap is obvious.
6

Section 06

Practical Significance and Application Recommendations

References for AI application developers in petroleum engineering:

  • General chart interpretation: GPT-5.5 and Claude-Opus-4.7 can be used in scenarios such as auxiliary document analysis and training material generation;
  • Specialized analysis tasks: Seismic facies identification, complex well logging interpretation, etc., require manual review or domain fine-tuning;
  • Hallucination control: In critical decision-making scenarios, prioritize models with low hallucination rates (e.g., GPT-5.5) or design human-machine collaboration processes;
  • Open-source path: Using Qwen3-VL as a base model for domain adaptation is feasible, but more resource investment is needed.
7

Section 07

Test Limitations and Summary

Limitations: Document-level/long-context comprehensive tasks were not included; reference answers were generated based on paper titles and general knowledge, not re-derived by independent experts; Gemini-3.1-Pro did not complete all tests; chart sources have varying license terms.

Summary: Cutting-edge multimodal large models far exceed the "surface description" level in petroleum engineering chart interpretation capabilities, but specialized sub-tasks still need improvement. This test provides valuable benchmark references and model selection basis for AI-assisted petroleum engineering applications.