Zing Forum

Reading

SAMA Dataset: A VQA Benchmark for Evaluating Spatial Reasoning Capabilities of Vision-Language Models

The first large-scale VQA dataset released by the University of California, Riverside, specifically designed to evaluate the local spatial reasoning capabilities of vision-language models on non-standard attraction maps, containing 4296 question-answer pairs.

SAMA数据集VQA视觉语言模型空间推理景点地图加州大学河滨分校基准测试多模态AI
Published 2026-06-17 09:16Recent activity 2026-06-17 09:23Estimated read 5 min
SAMA Dataset: A VQA Benchmark for Evaluating Spatial Reasoning Capabilities of Vision-Language Models
1

Section 01

Introduction: SAMA Dataset - A New Benchmark for Evaluating Spatial Reasoning Capabilities of VLMs

The SAMA Dataset is the first large-scale VQA benchmark released by the Al-Shareedah team at the University of California, Riverside, specifically designed to evaluate the local spatial reasoning capabilities of vision-language models on non-standard attraction maps. This dataset contains 4296 manually verified question-answer pairs, open-sourced on GitHub (link: https://github.com/Al-Shareedah/SAMA-Dataset), created on June 9, 2026, and updated on June 17.

2

Section 02

Project Background and Motivation

With the progress of vision-language models (VLMs) in tasks like image understanding, evaluating their spatial reasoning capabilities has become increasingly important. Traditional VQA benchmarks are mostly based on standard maps or natural images, while real navigation scenarios often use non-standard attraction maps (such as theme park or shopping mall maps). These maps are drawn non-proportionally and have no standard coordinates, posing unique challenges to AI. The SAMA Dataset was created to fill this evaluation gap.

3

Section 03

Data Generation Method and License

SAMA uses a human-machine collaborative approach to generate data: initial question-answer pairs are first generated using Gemini 3 Pro/Gemma 3, then 100% manually verified and revised to ensure quality. The dataset is open-sourced under the MIT License, allowing free use, modification, and distribution.

4

Section 04

Dataset Overview (Evidence)

SAMA contains 49 real attraction maps (covering 6 categories such as theme parks and zoos), with a total of 4296 question-answer pairs. Question types include facility search, relative positioning, etc. The question-answer pairs are organized in JSON by map category, with complete metadata—for example, shopping mall category questions involve queries about the number of facilities or relative directions.

5

Section 05

Core Challenges and Features

Non-standard attraction maps have characteristics such as non-proportional drawing, no geographic coordinates, symbolic representation, and diverse perspectives, which render traditional geographic reasoning methods ineffective. SAMA focuses on local spatial reasoning, requiring models to recognize symbols, understand relative directions, perform path planning, etc.

6

Section 06

Research Significance and Application Value (Conclusion)

SAMA provides a standardized platform for evaluating VLM spatial reasoning, helping to identify model bottlenecks and compare the pros and cons of different architectures. Its results can be applied to scenarios such as intelligent tour guides, indoor navigation, assistive technologies, and robot navigation.

7

Section 07

Current Limitations

SAMA has the following limitations: it only supports English in terms of language; although the map types are diverse, they can still be expanded (e.g., hospitals, campuses); the 4296 question-answer pairs are of medium scale, requiring a larger scale to improve generalization ability.

8

Section 08

Suggestions for Future Directions

In the future, we can expand multilingual support, add dynamic maps, introduce multi-turn dialogue VQA tasks, develop dedicated model architectures, etc., to further improve the dataset.