Zing Forum

Reading

InstructVideo: A Reasoning-Driven Video Object Segmentation Dataset for Multimodal Large Language Models

InstructVideo is a reasoning-centric video object segmentation dataset designed specifically for multimodal large language models. It contains 1,788 videos, 6,112 question-answer pairs, and 3,603 object annotations. To complete complex reasoning tasks, models need to have world knowledge and temporal understanding capabilities.

视频理解多模态大语言模型对象分割数据集推理时序理解计算机视觉
Published 2026-06-07 18:01Recent activity 2026-06-07 18:23Estimated read 7 min
InstructVideo: A Reasoning-Driven Video Object Segmentation Dataset for Multimodal Large Language Models
1

Section 01

Introduction / Main Floor: InstructVideo: A Reasoning-Driven Video Object Segmentation Dataset for Multimodal Large Language Models

InstructVideo is a reasoning-centric video object segmentation dataset designed specifically for multimodal large language models. It contains 1,788 videos, 6,112 question-answer pairs, and 3,603 object annotations. To complete complex reasoning tasks, models need to have world knowledge and temporal understanding capabilities.

2

Section 02

Original Authors and Source

3

Section 03

Background: Challenges in Video Understanding

Video understanding is one of the most challenging tasks in the field of computer vision. Unlike static images, videos contain temporal dimension information, requiring models to not only understand the content of each frame but also grasp complex information such as relationships between frames, temporal evolution of actions, and motion trajectories of objects.

Traditional Video Object Segmentation (VOS) datasets mainly focus on pixel-level mask prediction, with relatively simple task forms. However, with the rise of Multimodal Large Language Models (MLLMs), the research community has begun to explore more challenging video understanding tasks—requiring models not only to segment target objects but also to understand complex instructions, perform multi-step reasoning, and provide logically consistent textual answers.

InstructVideo was born to fill this research gap.

4

Section 04

Dataset Overview

InstructVideo is a reasoning-centric video object segmentation dataset specifically designed to evaluate and promote research on multimodal large language models in complex video understanding tasks. Unlike existing datasets, InstructVideo emphasizes reasoning capabilities—models need to have world knowledge and temporal understanding to correctly complete tasks.

5

Section 05

Core Statistics

  • Number of Videos: 1,788 video clips
  • Question-Answer Pairs: 6,112 QA pairs
  • Number of Objects: 3,603 target objects
  • Average Instances per Multi-Object Sample: 3.77
  • Maximum Instances per Sample: 16

These statistics indicate that InstructVideo is not only substantial in scale but also particularly focused on the complexity of multi-object scenarios, which is a common challenge in real-world video understanding.

6

Section 06

Reasoning-Centric Query Design

The most prominent feature of InstructVideo is its reasoning-centric query design. Traditional VOS datasets usually use simple descriptive instructions, such as "Segment the red car". In contrast, InstructVideo's queries require models to perform multi-step reasoning, for example:

  • "Find the boy who fell after chasing the ball"
  • "Segment the person who first picked up the book and then walked to the window"
  • "Which object disappears in the second half of the video?"

Such queries require models to understand high-level semantic information like action sequences, causal relationships, and temporal order, rather than just pixel-level matching.

7

Section 07

Balance Between Single-Object and Multi-Object Tasks

The dataset includes both single-object and multi-object segmentation tasks. Multi-object scenarios are particularly challenging because:

  • Need to distinguish between multiple similar objects (e.g., a specific person in a crowd)
  • Need to track interaction relationships between multiple objects
  • Need to handle complex situations like occlusion and overlap

InstructVideo's multi-object samples contain an average of 3.77 instances, with a maximum of 16, providing rich test scenarios for research on multi-object reasoning.

8

Section 08

Logical Textual Answers

Unlike traditional datasets that only require mask prediction, InstructVideo requires models to provide logical textual answers. This means models not only need to "see" the correct object but also "understand" the intent of the question and explain their reasoning process in natural language. This design is closer to how humans understand videos and provides a new dimension for evaluating the interpretability of MLLMs.