Zing Forum

Reading

OMIBench: A New Benchmark for Multi-Image Olympic-Level Reasoning

OMIBench is the first benchmark specifically targeting multi-image Olympic-level reasoning, covering four major domains—biology, chemistry, mathematics, and physics—with over 1000 questions. Even the strongest models like Gemini-3-Pro achieve an accuracy of only around 50%, revealing significant limitations of current large vision-language models (LVLMs) in cross-image reasoning.

OMIBench多图推理大视觉语言模型奥林匹克级别基准测试多模态推理LVLMChain-of-Thought跨图像推理
Published 2026-04-24 01:28Recent activity 2026-04-24 01:49Estimated read 5 min
OMIBench: A New Benchmark for Multi-Image Olympic-Level Reasoning
1

Section 01

OMIBench: A Guide to the New Benchmark for Multi-Image Olympic-Level Reasoning

OMIBench is the first benchmark specifically designed for multi-image Olympic-level reasoning, covering four major domains: biology, chemistry, mathematics, and physics, with over 1000 questions. Even the strongest model, Gemini-3-Pro, achieves an accuracy of only about 50%, revealing significant limitations of current large vision-language models (LVLMs) in cross-image reasoning. This benchmark was jointly developed by multiple universities, filling the gap in existing multimodal Olympic benchmarks limited to single-image settings.

2

Section 02

Evolution and Challenges of Multimodal Reasoning

In recent years, LVLMs have made significant progress in Olympic-level reasoning tasks, with Chain-of-Thought (CoT) prompting technology promoting the integration of visual cues and textual information. However, most existing multimodal Olympic benchmarks are limited to single-image problems, while real-world scenarios often rely on multiple related diagrams requiring cross-image and cross-modal reasoning—this is the core challenge at present.

3

Section 03

Design and Core Features of OMIBench

OMIBench was jointly developed by institutions including Harbin Institute of Technology and Central South University, and is the first multi-image Olympic reasoning benchmark. It contains over 1000 questions, with an average of 3.07 images per question, accompanied by manually annotated reasoning paths and answers. Core features:

  1. Requirement for multi-image information integration;
  2. Manual annotation of reasoning paths;
  3. Dual evaluation of precision and semantics;
  4. Coverage of four major scientific domains.
4

Section 04

Experimental Results and Model Capability Boundaries

Evaluation of state-of-the-art LVLMs shows: Gemini-3-Pro has an accuracy of about 50%, and no model exceeds 51% accuracy; performance drops by 15% compared to single-image benchmarks, and by more than 20% compared to existing multi-image benchmarks. Error analysis identifies three failure modes: visual perception failure, cross-image association failure, and cross-modal logic integration failure.

5

Section 05

Exploration of Improvement Strategies and Their Limitations

Various enhancement strategies were evaluated: Long CoT has limited gains; test-time expansion (parallel/sequential) leads to consistent but limited improvements; ICL improves performance but with diminishing returns; Think-with-Image has almost no gain or even degrades performance; parameter expansion has little effect. This indicates that architectural innovation is needed instead of mere scale expansion.

6

Section 06

Implications for the Research Community and Resource Access

Significance of OMIBench:

  1. Provides a standardized multi-image reasoning evaluation tool;
  2. Highlights the insufficiency of current technical paths, requiring new architectures/training paradigms;
  3. Manual reasoning paths facilitate interpretability research. Resources: Paper (arXiv:2604.20806), dataset (HuggingFace), code repository (GitHub), and unofficial implementation scaffolding.
7

Section 07

Conclusion: Challenges and Opportunities in Multi-Image Reasoning

OMIBench marks a new stage in multimodal reasoning evaluation, revealing the limitations of LVLMs in multi-image complex reasoning. For developers, it is both a challenge and an improvement target, pointing the way for the design of next-generation multimodal architectures. We look forward to the community breaking through multi-image reasoning capabilities.