# DeViL: Breaking the Efficiency Bottleneck of Spatiotemporal Localization in Video Large Models via Detector Empowerment

> DeViL proposes an innovative "Detector Empowerment" architecture, offloading dense spatial localization tasks from multimodal large language models (MLLMs) to fully parallelizable detectors. It achieves real-time performance of 14.33 FPS and an m_vIoU accuracy of 43.1% while maintaining strong reasoning capabilities.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-11T10:02:17.000Z
- 最近活动: 2026-05-11T10:20:28.339Z
- 热度: 148.7
- 关键词: 视频大模型, 时空定位, 目标检测, 多模态, MLLM, STVG, 高效推理
- 页面链接: https://www.zingnex.cn/en/forum/thread/devil
- Canonical: https://www.zingnex.cn/forum/thread/devil
- Markdown 来源: floors_fallback

---

## [Introduction] DeViL: Breaking the Efficiency Bottleneck of Spatiotemporal Localization in Video Large Models via Detector Empowerment

DeViL proposes an innovative "Detector Empowerment" architecture, offloading dense spatial localization tasks from multimodal large language models (MLLMs) to fully parallelizable detectors. It achieves real-time performance of 14.33 FPS and an m_vIoU accuracy of 43.1% while maintaining strong reasoning capabilities, effectively addressing the efficiency bottleneck of spatiotemporal localization in video large models.

## Project Background and Challenges

Multimodal large language models (MLLMs) are expanding to fine-grained spatiotemporal video grounding (STVG), but existing methods face efficiency bottlenecks:
1. Direct grounding paradigm: Decoding cost grows linearly with the query time span;
2. Candidate selection paradigm: Relies on high-cost candidate construction processes. Both limit practical deployment feasibility.

## Core Innovative Methods of DeViL

The core idea of DeViL is to offload spatial localization tasks to parallelizable detectors, including two major innovations:
1. **Reference Semantic Token Distillation**: Distill queries into detector-compatible tokens to replace text embeddings, completing spatial localization in a single forward pass and avoiding recursive decoding overhead;
2. **Temporal Consistency Regularization**: Match objects across frames, enforce temporal coherence, and ensure stable and continuous localization results for the same target.

## Technical Implementation Details

DeViL is built based on VideoLLaMA3 and GroundingDINO:
- VideoLLaMA3 provides strong video understanding capabilities;
- GroundingDINO provides efficient and accurate object detection capabilities.
Its modular design allows flexible integration into different MLLM architectures, offering new possibilities for video understanding research and applications.

## Performance and Experimental Results

In the HC-STVG benchmark test, DeViL achieved remarkable results:
- Accuracy: 43.1% m_vIoU;
- Efficiency: 14.33 FPS.
The results show that DeViL avoids long coordinate decoding and heavy candidate pipelines while maintaining the general reasoning capabilities of MLLMs, achieving breakthroughs in both performance and efficiency.

## Application Scenarios and Significance

DeViL's efficient spatiotemporal localization capabilities empower multiple scenarios:
- Intelligent surveillance: Real-time localization and analysis of specific events/objects;
- Autonomous driving: Fast identification and tracking of key road targets;
- Video content analysis: Providing precise spatiotemporal information for retrieval and summarization;
- Human-computer interaction: Supporting video content query and localization via natural language descriptions.

## Summary and Outlook

DeViL addresses the efficiency bottleneck of spatiotemporal localization in video large models through the "Detector Empowerment" architecture. Its idea of "offloading specific tasks to lightweight modules" provides a reference for the efficient expansion of MLLMs. As video content grows, solutions that balance accuracy and efficiency become increasingly important, and the open-source nature of this project also provides a valuable reference implementation for the community.