Section 01
Introduction / Main Floor: InstructVideo: A Reasoning-Driven Video Object Segmentation Dataset for Multimodal Large Language Models
InstructVideo is a reasoning-centric video object segmentation dataset designed specifically for multimodal large language models. It contains 1,788 videos, 6,112 question-answer pairs, and 3,603 object annotations. To complete complex reasoning tasks, models need to have world knowledge and temporal understanding capabilities.