Section 01
[Introduction] SceneTeract: Bridging the Gap Between Vision-Language Models and 3D Scene Understanding via Physical Verification
This article introduces the SceneTeract framework, which aims to evaluate the functional affordances of 3D scenes by combining high-level semantic reasoning with low-level geometric verification. The study finds that current state-of-the-art vision-language models (VLMs) have systematic biases in judging physical feasibility, and proposes using SceneTeract as a reward engine for post-training VLMs. The core of SceneTeract is a grounded verification engine that supports agent-specific functional assessment. Evaluations on synthetic scenes and VLMs reveal key flaws in existing systems, providing a technical foundation for the physical grounding of embodied AI.