Section 01
[Introduction] GeoR-Bench Benchmark: The Severe Current State of Multimodal Models' Earth Science Reasoning Capabilities
Institutions including the Chinese University of Hong Kong have released the GeoR-Bench benchmark, covering 440 samples, 6 earth science domains, and 24 task types. Tests show that top closed-source models achieve an accuracy of only 42.7%, while open-source models reach just 10.3%, revealing a severe bottleneck in current multimodal AI's reasoning capabilities in earth sciences.