Section 01
[Introduction] A Study on the Systematic Auditing Framework for Physical Reasoning Capabilities of Vision-Language Models
Newcastle University developed a visual auditing system based on the violation-of-expectation framework. Using the classic Shell Game task, it tests the capabilities of cutting-edge Vision-Language Models (VLMs) in object permanence, temporal continuity, and hidden state reasoning, revealing key issues such as fundamental limitations in physical understanding and calibration gaps in current models.