Section 01
[Introduction] EgoPoint-Ground: A New Breakthrough in Multimodal Visual Localization for AI to Understand Gesture Pointing
This article introduces EgoPoint-Ground, a new work focusing on gesture pointing understanding and visual localization from a first-person perspective. This work includes the first large-scale multimodal dataset (over 15000 interaction samples) and proposes the SV-CoT structured visual reasoning method, which achieves an 11.7% performance improvement compared to the current best solution, promoting the shift of visual localization from pure language to "language + gesture" multimodal understanding.