Section 01
[Main Post/Introduction] Bottlenecks in Vision-Language Navigation: The Impact of 3D Scene Understanding Ability on Zero-Shot VLN Performance
This paper quantifies the actual impact of 3D scene understanding ability on the performance of zero-shot Vision-Language Navigation (VLN), revealing the phenomenon of perceptual saturation—when perceptual precision exceeds a threshold, the gain in navigation success rate from further improvement decreases sharply. The study proposes that 3D understanding in VLN should shift from pixel-level precision to navigation-relevant core semantics and bounding box proportions, providing new ideas for designing more efficient navigation systems.