Section 01
Introduction to the Tractatus-Eval Benchmark: Revealing the Cognitive Limitations of Large Language Models in Spatial Embodied Reasoning
Tractatus-Eval is an evaluation benchmark for the spatial embodied logical capabilities of large language models, inspired by Wittgenstein's philosophy. It aims to quantify the capability boundaries of LLMs in spatial embodied reasoning tasks and reveal the cognitive limitations of text-only models. Through six physical reasoning tasks and a zero-contamination verification mechanism, this benchmark provides a reliable measurement tool for the AI research community, helping to understand the capability boundaries of LLMs and guide the design of next-generation systems.