Section 01
Introduction: SlotVTG Framework Solves Cross-domain Generalization Challenge of MLLMs in Video Temporal Grounding
The SlotVTG framework addresses the cross-domain generalization challenge of multimodal large language models (MLLMs) in video temporal grounding (VTG) tasks using a lightweight object-centric adapter. This method guides MLLMs to perform object-level visual reasoning without retraining the entire model, significantly improving generalization on out-of-domain data.