Section 01
[Introduction] SAMA: A New Framework for Large Language Models in Multi-turn Referential Video Dialogue
SAMA is a large language model framework for multi-turn referential video dialogue accepted by NeurIPS 2025, aiming to address the core challenge of unifying spatiotemporal semantic understanding and precise referential localization in video comprehension. The project forms a complete technical system by building high-quality datasets, innovative model architectures, and comprehensive evaluation benchmarks, significantly enhancing the fine-grained spatiotemporal understanding capabilities of video large language models. The code will be open-sourced soon.