Section 01
[Introduction] DeViL: Breaking the Efficiency Bottleneck of Spatiotemporal Localization in Video Large Models via Detector Empowerment
DeViL proposes an innovative "Detector Empowerment" architecture, offloading dense spatial localization tasks from multimodal large language models (MLLMs) to fully parallelizable detectors. It achieves real-time performance of 14.33 FPS and an m_vIoU accuracy of 43.1% while maintaining strong reasoning capabilities, effectively addressing the efficiency bottleneck of spatiotemporal localization in video large models.