Section 01
FindIt Benchmark: A New Tool for Evaluating Visual Localization Capabilities of Multimodal Large Models
FindIt is the first comprehensive benchmark specifically designed to evaluate the promptable localization capabilities of general-purpose multimodal large language models (MLLMs). It covers four major task categories: object detection, referring expression detection, instance-level detection, and video detection, revealing the strengths and limitations of current models in structured visual tasks.
Original Authors and Source: The paper author team (arXiv 2606.04282v1), published on June 2, 2026, original link: http://arxiv.org/abs/2606.04282v1