Zing Forum

Reading

DeViL: Breaking the Efficiency Bottleneck of Spatiotemporal Localization in Video Large Models via Detector Empowerment

DeViL proposes an innovative "Detector Empowerment" architecture, offloading dense spatial localization tasks from multimodal large language models (MLLMs) to fully parallelizable detectors. It achieves real-time performance of 14.33 FPS and an m_vIoU accuracy of 43.1% while maintaining strong reasoning capabilities.

视频大模型时空定位目标检测多模态MLLMSTVG高效推理
Published 2026-05-11 18:02Recent activity 2026-05-11 18:20Estimated read 5 min
DeViL: Breaking the Efficiency Bottleneck of Spatiotemporal Localization in Video Large Models via Detector Empowerment
1

Section 01

[Introduction] DeViL: Breaking the Efficiency Bottleneck of Spatiotemporal Localization in Video Large Models via Detector Empowerment

DeViL proposes an innovative "Detector Empowerment" architecture, offloading dense spatial localization tasks from multimodal large language models (MLLMs) to fully parallelizable detectors. It achieves real-time performance of 14.33 FPS and an m_vIoU accuracy of 43.1% while maintaining strong reasoning capabilities, effectively addressing the efficiency bottleneck of spatiotemporal localization in video large models.

2

Section 02

Project Background and Challenges

Multimodal large language models (MLLMs) are expanding to fine-grained spatiotemporal video grounding (STVG), but existing methods face efficiency bottlenecks:

  1. Direct grounding paradigm: Decoding cost grows linearly with the query time span;
  2. Candidate selection paradigm: Relies on high-cost candidate construction processes. Both limit practical deployment feasibility.
3

Section 03

Core Innovative Methods of DeViL

The core idea of DeViL is to offload spatial localization tasks to parallelizable detectors, including two major innovations:

  1. Reference Semantic Token Distillation: Distill queries into detector-compatible tokens to replace text embeddings, completing spatial localization in a single forward pass and avoiding recursive decoding overhead;
  2. Temporal Consistency Regularization: Match objects across frames, enforce temporal coherence, and ensure stable and continuous localization results for the same target.
4

Section 04

Technical Implementation Details

DeViL is built based on VideoLLaMA3 and GroundingDINO:

  • VideoLLaMA3 provides strong video understanding capabilities;
  • GroundingDINO provides efficient and accurate object detection capabilities. Its modular design allows flexible integration into different MLLM architectures, offering new possibilities for video understanding research and applications.
5

Section 05

Performance and Experimental Results

In the HC-STVG benchmark test, DeViL achieved remarkable results:

  • Accuracy: 43.1% m_vIoU;
  • Efficiency: 14.33 FPS. The results show that DeViL avoids long coordinate decoding and heavy candidate pipelines while maintaining the general reasoning capabilities of MLLMs, achieving breakthroughs in both performance and efficiency.
6

Section 06

Application Scenarios and Significance

DeViL's efficient spatiotemporal localization capabilities empower multiple scenarios:

  • Intelligent surveillance: Real-time localization and analysis of specific events/objects;
  • Autonomous driving: Fast identification and tracking of key road targets;
  • Video content analysis: Providing precise spatiotemporal information for retrieval and summarization;
  • Human-computer interaction: Supporting video content query and localization via natural language descriptions.
7

Section 07

Summary and Outlook

DeViL addresses the efficiency bottleneck of spatiotemporal localization in video large models through the "Detector Empowerment" architecture. Its idea of "offloading specific tasks to lightweight modules" provides a reference for the efficient expansion of MLLMs. As video content grows, solutions that balance accuracy and efficiency become increasingly important, and the open-source nature of this project also provides a valuable reference implementation for the community.