# OMIBench: A New Benchmark for Multi-Image Olympic-Level Reasoning

> OMIBench is the first benchmark specifically targeting multi-image Olympic-level reasoning, covering four major domains—biology, chemistry, mathematics, and physics—with over 1000 questions. Even the strongest models like Gemini-3-Pro achieve an accuracy of only around 50%, revealing significant limitations of current large vision-language models (LVLMs) in cross-image reasoning.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-23T17:28:50.000Z
- 最近活动: 2026-04-23T17:49:22.033Z
- 热度: 152.7
- 关键词: OMIBench, 多图推理, 大视觉语言模型, 奥林匹克级别, 基准测试, 多模态推理, LVLM, Chain-of-Thought, 跨图像推理
- 页面链接: https://www.zingnex.cn/en/forum/thread/omibench
- Canonical: https://www.zingnex.cn/forum/thread/omibench
- Markdown 来源: floors_fallback

---

## OMIBench: A Guide to the New Benchmark for Multi-Image Olympic-Level Reasoning

OMIBench is the first benchmark specifically designed for multi-image Olympic-level reasoning, covering four major domains: biology, chemistry, mathematics, and physics, with over 1000 questions. Even the strongest model, Gemini-3-Pro, achieves an accuracy of only about 50%, revealing significant limitations of current large vision-language models (LVLMs) in cross-image reasoning. This benchmark was jointly developed by multiple universities, filling the gap in existing multimodal Olympic benchmarks limited to single-image settings.

## Evolution and Challenges of Multimodal Reasoning

In recent years, LVLMs have made significant progress in Olympic-level reasoning tasks, with Chain-of-Thought (CoT) prompting technology promoting the integration of visual cues and textual information. However, most existing multimodal Olympic benchmarks are limited to single-image problems, while real-world scenarios often rely on multiple related diagrams requiring cross-image and cross-modal reasoning—this is the core challenge at present.

## Design and Core Features of OMIBench

OMIBench was jointly developed by institutions including Harbin Institute of Technology and Central South University, and is the first multi-image Olympic reasoning benchmark. It contains over 1000 questions, with an average of 3.07 images per question, accompanied by manually annotated reasoning paths and answers. Core features:
1. Requirement for multi-image information integration;
2. Manual annotation of reasoning paths;
3. Dual evaluation of precision and semantics;
4. Coverage of four major scientific domains.

## Experimental Results and Model Capability Boundaries

Evaluation of state-of-the-art LVLMs shows: Gemini-3-Pro has an accuracy of about 50%, and no model exceeds 51% accuracy; performance drops by 15% compared to single-image benchmarks, and by more than 20% compared to existing multi-image benchmarks. Error analysis identifies three failure modes: visual perception failure, cross-image association failure, and cross-modal logic integration failure.

## Exploration of Improvement Strategies and Their Limitations

Various enhancement strategies were evaluated: Long CoT has limited gains; test-time expansion (parallel/sequential) leads to consistent but limited improvements; ICL improves performance but with diminishing returns; Think-with-Image has almost no gain or even degrades performance; parameter expansion has little effect. This indicates that architectural innovation is needed instead of mere scale expansion.

## Implications for the Research Community and Resource Access

Significance of OMIBench:
1. Provides a standardized multi-image reasoning evaluation tool;
2. Highlights the insufficiency of current technical paths, requiring new architectures/training paradigms;
3. Manual reasoning paths facilitate interpretability research. Resources: Paper (arXiv:2604.20806), dataset (HuggingFace), code repository (GitHub), and unofficial implementation scaffolding.

## Conclusion: Challenges and Opportunities in Multi-Image Reasoning

OMIBench marks a new stage in multimodal reasoning evaluation, revealing the limitations of LVLMs in multi-image complex reasoning. For developers, it is both a challenge and an improvement target, pointing the way for the design of next-generation multimodal architectures. We look forward to the community breaking through multi-image reasoning capabilities.
