Zing Forum

Reading

TurnBack: Evaluating Geospatial Cognitive Ability of Large Language Models via Reverse Path Tasks

TurnBack is an innovative benchmark that evaluates the geospatial reasoning and navigation cognitive abilities of large language models by having them handle reverse path tasks, revealing the strengths and limitations of current models in spatial understanding.

地理空间认知大语言模型基准测试空间推理导航EMNLP路径规划具身智能
Published 2026-04-06 03:11Recent activity 2026-04-06 03:18Estimated read 6 min
TurnBack: Evaluating Geospatial Cognitive Ability of Large Language Models via Reverse Path Tasks
1

Section 01

[Introduction] TurnBack Benchmark: Evaluating Geospatial Cognitive Ability of Large Language Models via Reverse Path Tasks

TurnBack is an innovative benchmark that assesses the geospatial reasoning and navigation cognitive abilities of large language models through reverse path tasks, revealing the strengths and limitations of current models in spatial understanding. This benchmark has been accepted by EMNLP 2025, with its core innovation lying in the adoption of the "reverse path" paradigm, which tests the model's ability to deeply understand spatial relationships. This article will discuss aspects such as background, methodology, experimental findings, error analysis, and future directions.

2

Section 02

Background: Spatial Intelligence and Spatial Cognitive Challenges of Large Language Models

Geospatial cognition is at the core of human intelligence, involving spatial relationship understanding, path planning, and memory, which are crucial for AI to achieve natural human-computer interaction and autonomous decision-making. Large language models have made significant progress in text understanding and generation, but their spatial cognitive ability remains an open question. The TurnBack benchmark is designed to systematically evaluate this ability.

3

Section 03

Methodology: Innovative Design Ideas of the TurnBack Benchmark

The core innovation of TurnBack lies in its "reverse path" testing paradigm: given a path description from point A to point B, the model is required to generate the reverse path from B back to A. This is not just a direction reversal; it requires the model to understand the relative positions of landmarks, identify reversible/irreversible road segments (e.g., one-way streets), and convert turn instructions (e.g., left turn to right turn), effectively distinguishing between models with true spatial understanding and those relying on surface pattern matching.

4

Section 04

Methodology: Dataset Construction and Task Hierarchy Design

The TurnBack dataset follows linguistic principles and geoinformation science standards, collecting real-world navigation scenarios (urban streets, parks, indoor spaces, etc.). Each sample includes the original path description, reverse path description, and structured verification information. Tasks are divided into different difficulty levels (from simple straight paths to complex multi-turn routes, familiar/unfamiliar environments), allowing evaluation of model performance under varying complexities.

5

Section 05

Experimental Findings: Current State of Spatial Cognitive Ability in Large Language Models

TurnBack uses a multi-dimensional evaluation system, including text similarity metrics (BLEU, ROUGE) and spatial task-specific metrics (path accuracy rate, turn accuracy rate, landmark recognition rate). Experimental results show: current mainstream large language models perform far below human levels; model size is positively but non-linearly correlated with spatial reasoning ability; models face obvious difficulties in handling specific spatial relationships such as relative direction and distance estimation.

6

Section 06

Error Analysis: Systematic Limitations of Spatial Cognition in Large Language Models

In-depth error analysis reveals the systematic limitations of models. Common errors include direction confusion (left-right reversal), distance misjudgment, topological errors (incorrect judgment of landmark connectivity), and lack of ability to recognize irreversible road segments. This indicates that models have not established an inherent flexible spatial representation and rely more on text pattern matching rather than spatial reasoning.

7

Section 07

Application Value and Future Research Directions

The TurnBack benchmark has academic and practical value: it provides a unified standard for evaluating model spatial cognition, guiding model optimization in application scenarios such as navigation systems and intelligent assistants; it reveals the potential limitations of large language models in the field of embodied intelligence. The project is fully open-source (dataset, evaluation code, framework). Future directions include expanding the dataset, developing dedicated architectures for spatial reasoning, exploring multimodal fusion, and injecting spatial knowledge into pre-trained models.