Reading

SAMA: A Multi-turn Dialogue Framework for LLMs to Truly Understand Videos and Precisely Locate Objects

The SAMA framework, open-sourced by the Fudan University team, for the first time unifies video referential understanding and visual localization into a multi-turn dialogue task. It was published at NeurIPS 2025, with 239,000 training data samples and complete code open-sourced.

SAMA视频大语言模型视频指代理解视频定位多轮对话NeurIPS 2025复旦大学Segment Anything视频分割多模态AI

Published 2026-05-20 16:44Recent activity 2026-05-20 16:48Estimated read 5 min

SAMA: A Multi-turn Dialogue Framework for LLMs to Truly Understand Videos and Precisely Locate Objects

Section 01

Introduction: SAMA Framework—A Breakthrough in Video Large Language Models

The SAMA framework, published by the Fudan University team at NeurIPS 2025, for the first time unifies video referential understanding and visual localization into a multi-turn dialogue task. With 239,000 training data samples and complete code open-sourced, it provides a new solution to video understanding challenges.

Section 02

Core Challenges in Video Understanding

Current video large language models face two core challenges: video referential understanding (comprehending the semantics of specific regions/objects mentioned by users) and video localization (precisely segmenting objects based on descriptions). Existing methods mostly handle these two tasks separately, limiting the evolution of models into multimodal intelligent assistants.

Section 03

SAMA's Three-in-One Innovative Solution

SAMA systematically addresses the problem from three aspects: dataset, model architecture, and evaluation benchmark:

SAMA-239K Dataset: Integrates 239,000 samples from 15,000 videos, supporting joint learning of referential understanding, localization, and multi-turn dialogue;
Model Architecture: Includes a spatiotemporal context aggregator (tracking object trajectories and cross-frame association) and integration with Segment Anything Model (zero-shot segmentation capability), with 1B/4B/8B scale weights open-sourced;
SAMA-Bench Benchmark: 5,067 questions across 522 videos, providing a unified evaluation standard.

Section 04

Experimental Results: Multiple SOTAs and Strong Generalization Capability

SAMA performs leading on multiple benchmarks:

Significantly outperforms existing methods on SAMA-Bench;
Achieves new SOTA on general video localization benchmarks (e.g., Ref-DAVIS, Ref-Youtube-VOS);
Maintains competitiveness on standard visual understanding benchmarks and shows robust generalization on unseen video types.

Section 05

Technical Implementation Details

Environment Configuration: Based on PyTorch 2.3.1, CUDA 12.1, and mmcv;
Training Strategy: Distributed training on 8 A100 (80G) cards, supporting three model scales, with weight conversion scripts provided;
Inference Support: Provides complete evaluation scripts for image/video segmentation tasks, lowering the threshold for reproduction.

Section 06

Application Prospects and Significance

Academic: Unifies the fields of video referential understanding and localization, spurring cross-directional research;
Industrial: Multi-turn dialogue capability can be applied to scenarios like intelligent monitoring, video review, and educational assistance;
Open-source Ecosystem: Complete data, code, and models are open-sourced, accelerating the development of the field.

Section 07

Conclusion: An Important Step Towards the Practicalization of Video Large Models

SAMA achieves technical breakthroughs and demonstrates the value of combining academia and engineering. As the proportion of video content rises, such technologies that deeply understand videos and interact naturally will play a key role in AI applications, providing an excellent starting point for researchers and developers to explore.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54