Zing Forum

Reading

A Comprehensive Review of Multimodal Large Language Models in Image and Video Segmentation

An in-depth analysis of the Awesome-MLLM-Segmentation repository, covering over 30 cutting-edge studies from referring expression segmentation to open-vocabulary semantic segmentation, revealing how MLLMs are reshaping pixel-level understanding in computer vision.

多模态大语言模型图像分割视频分割指代表达分割开放词汇语义分割推理分割计算机视觉MLLMSAMLLaVA
Published 2026-04-12 16:05Recent activity 2026-04-12 16:18Estimated read 7 min
A Comprehensive Review of Multimodal Large Language Models in Image and Video Segmentation
1

Section 01

[Introduction] Multimodal Large Language Models Reshape the Paradigm of Image and Video Segmentation Technology

Based on the Awesome-MLLM-Segmentation repository, this article summarizes over 30 cutting-edge studies from top conferences/journals between 2023 and 2025, covering core directions such as referring expression segmentation, open-vocabulary semantic segmentation, video segmentation, and reasoning segmentation. It reveals how Multimodal Large Language Models (MLLMs) are reshaping pixel-level understanding of images and videos, and also includes applications in vertical fields like remote sensing and prospects for technical trends.

2

Section 02

Background: Limitations of Traditional Segmentation and the Transformation by MLLMs

Traditional image segmentation (semantic, instance, panoramic segmentation) requires task-specific architecture design and training processes. The rise of MLLMs like GPT-4V and LLaVA has extended powerful reasoning capabilities to the pixel level. Awesome-MLLM-Segmentation systematically collects key progress in this field, redefining the paradigm of segmentation technology.

3

Section 03

Referring Expression Segmentation: Breakthrough from Text to Precise Masks

Referring Expression Segmentation (RES) requires models to segment specific objects according to text descriptions:

  • LISA (CVPR 2024):First introduces reasoning capabilities, uses chain-of-thought to explain decisions, and embeds segmentation masks as visual tokens into the output space of language models;
  • GLaMM (CVPR 2024):Supports multi-object reference and complex interactions, with fine-grained pixel-level grounding;
  • PixelLM (CVPR 2024):Pixel attention mechanism improves segmentation accuracy in scenes with blurred boundaries or occlusions.
4

Section 04

Open-Vocabulary Semantic Segmentation: Breaking the Limitation of Predefined Categories

Open-vocabulary semantic segmentation breaks the limitation of predefined categories:

  • GSVA (CVPR 2024):Generalized segmentation concept, hierarchically aligns visual features with concept descriptions to achieve zero-shot generalization for new categories;
  • GROUNDHOG (CVPR 2024):Holistic segmentation, understanding all regions of the image (including background);
  • OMG-LLaVA (NeurIPS 2024):Unified architecture for handling multiple tasks like image classification, detection, and segmentation.
5

Section 05

Video Segmentation: Leap from Static to Dynamic

Video segmentation needs to handle spatiotemporal dynamics:

  • VISA (ECCV 2024):The first video MLLM segmentation framework, uses multi-turn dialogue to refine results, and a temporal consistency mechanism ensures inter-frame coherence;
  • VITRON (NeurIPS 2024):Unified pixel-level model supporting full-stack operations like understanding, segmentation, generation, and editing;
  • Sa2VA (ArXiv 2025):Combination of SAM2 and LLaVA, achieving breakthroughs in dense video understanding.
6

Section 06

Reasoning Segmentation: Segmentation That Teaches Models to 'Think'

Reasoning segmentation requires models to understand instructions before segmentation:

  • CoReS (ECCV 2024):Collaboration between reasoning and segmentation, with a bidirectional feedback mechanism to dynamically adjust strategies;
  • SegLLM (ICLR 2025):Multi-turn dialogue interaction to guide the model to approach target results;
  • Seg-Zero (ArXiv 2025):Cognitive reasoning chain guides segmentation, excelling in common-sense reasoning tasks.
7

Section 07

Vertical Applications: Exploration of MLLMs in Remote Sensing

Applications of MLLMs in remote sensing:

  • GeoGround (ArXiv 2024):The first large VLM for remote sensing visual localization, introducing geospatial priors to improve accuracy;
  • RSUniVLM (ArXiv 2024):Unified remote sensing VLM, with a granularity-guided mixture-of-experts architecture adapting to different resolutions;
  • GeoPix (ArXiv 2025):Pixel-level understanding for remote sensing, leading in multiple benchmarks.
8

Section 08

Technical Trends and Future Prospects

Technical Trends:

  1. Unified Architecture: Such as OMG-LLaVA and VITRON, single models handling multiple tasks;
  2. Reasoning Capability: The importance of interpretability is highlighted (LISA, CoReS);
  3. Deep Multimodal Fusion: Fine-grained fusion strategies replace simple concatenation. Future Prospects: Expectations for complex scene processing, natural interaction, and interpretable systems; need to explore topics like reducing computational costs, improving real-time performance, and ensuring result reliability.