Reading

SAMA Dataset: A New Benchmark for Evaluating Spatial Reasoning Capabilities of Vision-Language Models on Non-Standard Guide Maps

The SAMA dataset, released by the University of California, Riverside, contains 49 real-scene guide maps and 4296 question-answer pairs, specifically designed to evaluate the spatial reasoning capabilities of Vision-Language Models (VLMs) on non-standard maps such as theme parks, zoos, and resorts.

VQA视觉问答视觉语言模型空间推理导览地图多模态AI基准数据集Vision-Language ModelsSpatial Reasoning

Published 2026-06-17 09:16Recent activity 2026-06-17 09:20Estimated read 6 min

Section 01

SAMA Dataset: A New Benchmark for Evaluating Spatial Reasoning Capabilities of Vision-Language Models on Non-Standard Guide Maps

The SAMA dataset, released by the University of California, Riverside, is the first large-scale visual question-answering benchmark targeting non-standard attraction guide maps. It includes 49 real-scene guide maps (covering 6 categories such as theme parks, zoos, and resorts) and 4296 manually verified question-answer pairs, aiming to fill the gap in evaluating the spatial reasoning capabilities of existing Vision-Language Models (VLMs) on non-standard maps.

Section 02

Background and Motivation: Limitations of Existing VLM Evaluations and Real-Scene Needs

With the development of multimodal large models, VLMs have made progress in image understanding and image-text question answering, but existing benchmarks mostly focus on standard scenarios (natural images, standard maps, etc.). In reality, a large number of non-standard guide maps (such as amusement park schematics) are not to scale and use stylized symbols, which are not covered by traditional VQA datasets. The SAMA dataset aims to answer: Can VLMs understand spatial relationships in non-standard guide maps, such as questions like 'How to get from the carousel to the roller coaster?'?

Section 03

Dataset Overview: Scale, Categories, and Question Types

Key statistics of the SAMA (Spatial Answering over Maps of Attractions) dataset: 49 real guide maps, 4296 manually verified question-answer pairs; covering 6 categories of scenarios (theme parks, zoos, resorts, shopping malls, museums, trails); question types include facility search, legend symbol interpretation, relative position judgment, direction navigation, etc. Data generation combines assistance from Gemini 3 Pro/Gemma3 and 100% manual verification.

Section 04

Data Structure and Examples: JSON Format and Typical Question-Answer Pairs

The SAMA dataset is organized into JSON files by map category, with each question-answer record containing fields such as question_id, image_id, question, and reference_answers. Examples: A mall-related question 'How many Clothing stores are there in the mall?' has the answer '10.0'; a spatial orientation question 'In which map direction is Swarovski located relative to Sushi Siam?' has the answer 'Southwest'.

Section 05

Evaluation Dimensions: Four Major Challenges for VLMs

The SAMA dataset evaluates VLMs from four dimensions: 1. Symbol and legend understanding (mapping facility names to stylized symbols); 2. Relative position reasoning (relationships like 'left'/'nearby' on non-scaled maps); 3. Direction and navigation understanding (path planning, direction judgment); 4. Cross-category generalization (transferring reasoning capabilities across guide maps of different scenarios).

Section 06

Research Significance and Applications: Promoting VLM Development and Intelligent Navigation

Significance of SAMA:

Provides a standardized benchmark for VLM spatial reasoning, identifying the boundaries of model capabilities;
Assists in the development of intelligent navigation assistants (e.g., tourists taking photos of guide maps to ask for directions);
Serves as a multimodal AI education case to help understand VLM capabilities and challenges.

Section 07

Access and Usage: Open-Source License and Resource Content

The SAMA dataset is open-sourced under the MIT license, including: the data/ directory (JSON-formatted question-answer data), the maps/ directory (guide map images), and README.md (usage instructions). The dataset is built with a combination of LLM generation and manual verification to ensure quality.

Section 08

Summary and Outlook: Filling the Gap, Looking Forward to Model Breakthroughs

SAMA fills the gap in evaluating the spatial reasoning of VLMs on non-standard guide maps, providing a tool to assess and enhance the spatial understanding capabilities of VLMs in real scenarios. We look forward to more models achieving breakthroughs on SAMA in the future, enabling more intelligent visual question-answering systems to help people navigate complex spatial environments.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23