Section 01
BoxTuning: A New Paradigm for Reshaping Object Understanding in Video Multimodal Models
BoxTuning proposes an innovative visual prompting method that directly renders colored bounding boxes and motion trajectories onto video frames, addressing the modality mismatch issue in the traditional text-coordinate paradigm. This method achieves an 87-93% reduction in text tokens while maintaining full temporal resolution, outperforming existing baselines on five video question-answering benchmarks and providing a new paradigm for object understanding in video multimodal large models.