# SAMA: A Multi-turn Dialogue Framework for LLMs to Truly Understand Videos and Precisely Locate Objects

> The SAMA framework, open-sourced by the Fudan University team, for the first time unifies video referential understanding and visual localization into a multi-turn dialogue task. It was published at NeurIPS 2025, with 239,000 training data samples and complete code open-sourced.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-20T08:44:58.000Z
- 最近活动: 2026-05-20T08:48:33.751Z
- 热度: 154.9
- 关键词: SAMA, 视频大语言模型, 视频指代理解, 视频定位, 多轮对话, NeurIPS 2025, 复旦大学, Segment Anything, 视频分割, 多模态AI
- 页面链接: https://www.zingnex.cn/en/forum/thread/sama-31e482f4
- Canonical: https://www.zingnex.cn/forum/thread/sama-31e482f4
- Markdown 来源: floors_fallback

---

## Introduction: SAMA Framework—A Breakthrough in Video Large Language Models

The SAMA framework, published by the Fudan University team at NeurIPS 2025, for the first time unifies video referential understanding and visual localization into a multi-turn dialogue task. With 239,000 training data samples and complete code open-sourced, it provides a new solution to video understanding challenges.

## Core Challenges in Video Understanding

Current video large language models face two core challenges: video referential understanding (comprehending the semantics of specific regions/objects mentioned by users) and video localization (precisely segmenting objects based on descriptions). Existing methods mostly handle these two tasks separately, limiting the evolution of models into multimodal intelligent assistants.

## SAMA's Three-in-One Innovative Solution

SAMA systematically addresses the problem from three aspects: dataset, model architecture, and evaluation benchmark:
1. **SAMA-239K Dataset**: Integrates 239,000 samples from 15,000 videos, supporting joint learning of referential understanding, localization, and multi-turn dialogue;
2. **Model Architecture**: Includes a spatiotemporal context aggregator (tracking object trajectories and cross-frame association) and integration with Segment Anything Model (zero-shot segmentation capability), with 1B/4B/8B scale weights open-sourced;
3. **SAMA-Bench Benchmark**: 5,067 questions across 522 videos, providing a unified evaluation standard.

## Experimental Results: Multiple SOTAs and Strong Generalization Capability

SAMA performs leading on multiple benchmarks:
- Significantly outperforms existing methods on SAMA-Bench;
- Achieves new SOTA on general video localization benchmarks (e.g., Ref-DAVIS, Ref-Youtube-VOS);
- Maintains competitiveness on standard visual understanding benchmarks and shows robust generalization on unseen video types.

## Technical Implementation Details

- **Environment Configuration**: Based on PyTorch 2.3.1, CUDA 12.1, and mmcv;
- **Training Strategy**: Distributed training on 8 A100 (80G) cards, supporting three model scales, with weight conversion scripts provided;
- **Inference Support**: Provides complete evaluation scripts for image/video segmentation tasks, lowering the threshold for reproduction.

## Application Prospects and Significance

- **Academic**: Unifies the fields of video referential understanding and localization, spurring cross-directional research;
- **Industrial**: Multi-turn dialogue capability can be applied to scenarios like intelligent monitoring, video review, and educational assistance;
- **Open-source Ecosystem**: Complete data, code, and models are open-sourced, accelerating the development of the field.

## Conclusion: An Important Step Towards the Practicalization of Video Large Models

SAMA achieves technical breakthroughs and demonstrates the value of combining academia and engineering. As the proportion of video content rises, such technologies that deeply understand videos and interact naturally will play a key role in AI applications, providing an excellent starting point for researchers and developers to explore.
