Section 01
ESPIRE Benchmark: Evaluating Embodied Spatial Reasoning Capabilities of Vision-Language Models (Introduction)
Current vision-language models (VLMs) perform well in tasks like image captioning and visual question answering, but they have shortcomings in embodied spatial reasoning (understanding and reasoning about physical spatial relationships). ESPIRE (Embodied Spatial Reasoning Benchmark) is a diagnostic benchmark targeting this capability. Through designs such as embodied perspectives, multi-level spatial relationships, and diverse reasoning types, it evaluates models' spatial understanding abilities, reveals their limitations, and provides directions for improvement.