Zing Forum

Reading

SAMA: A Multi-turn Dialogue Framework for LLMs to Truly Understand Videos and Precisely Locate Objects

The SAMA framework, open-sourced by the Fudan University team, for the first time unifies video referential understanding and visual localization into a multi-turn dialogue task. It was published at NeurIPS 2025, with 239,000 training data samples and complete code open-sourced.

SAMA视频大语言模型视频指代理解视频定位多轮对话NeurIPS 2025复旦大学Segment Anything视频分割多模态AI
Published 2026-05-20 16:44Recent activity 2026-05-20 16:48Estimated read 5 min
SAMA: A Multi-turn Dialogue Framework for LLMs to Truly Understand Videos and Precisely Locate Objects
1

Section 01

Introduction: SAMA Framework—A Breakthrough in Video Large Language Models

The SAMA framework, published by the Fudan University team at NeurIPS 2025, for the first time unifies video referential understanding and visual localization into a multi-turn dialogue task. With 239,000 training data samples and complete code open-sourced, it provides a new solution to video understanding challenges.

2

Section 02

Core Challenges in Video Understanding

Current video large language models face two core challenges: video referential understanding (comprehending the semantics of specific regions/objects mentioned by users) and video localization (precisely segmenting objects based on descriptions). Existing methods mostly handle these two tasks separately, limiting the evolution of models into multimodal intelligent assistants.

3

Section 03

SAMA's Three-in-One Innovative Solution

SAMA systematically addresses the problem from three aspects: dataset, model architecture, and evaluation benchmark:

  1. SAMA-239K Dataset: Integrates 239,000 samples from 15,000 videos, supporting joint learning of referential understanding, localization, and multi-turn dialogue;
  2. Model Architecture: Includes a spatiotemporal context aggregator (tracking object trajectories and cross-frame association) and integration with Segment Anything Model (zero-shot segmentation capability), with 1B/4B/8B scale weights open-sourced;
  3. SAMA-Bench Benchmark: 5,067 questions across 522 videos, providing a unified evaluation standard.
4

Section 04

Experimental Results: Multiple SOTAs and Strong Generalization Capability

SAMA performs leading on multiple benchmarks:

  • Significantly outperforms existing methods on SAMA-Bench;
  • Achieves new SOTA on general video localization benchmarks (e.g., Ref-DAVIS, Ref-Youtube-VOS);
  • Maintains competitiveness on standard visual understanding benchmarks and shows robust generalization on unseen video types.
5

Section 05

Technical Implementation Details

  • Environment Configuration: Based on PyTorch 2.3.1, CUDA 12.1, and mmcv;
  • Training Strategy: Distributed training on 8 A100 (80G) cards, supporting three model scales, with weight conversion scripts provided;
  • Inference Support: Provides complete evaluation scripts for image/video segmentation tasks, lowering the threshold for reproduction.
6

Section 06

Application Prospects and Significance

  • Academic: Unifies the fields of video referential understanding and localization, spurring cross-directional research;
  • Industrial: Multi-turn dialogue capability can be applied to scenarios like intelligent monitoring, video review, and educational assistance;
  • Open-source Ecosystem: Complete data, code, and models are open-sourced, accelerating the development of the field.
7

Section 07

Conclusion: An Important Step Towards the Practicalization of Video Large Models

SAMA achieves technical breakthroughs and demonstrates the value of combining academia and engineering. As the proportion of video content rises, such technologies that deeply understand videos and interact naturally will play a key role in AI applications, providing an excellent starting point for researchers and developers to explore.