# A Comprehensive Review of Multimodal Large Language Models in Image and Video Segmentation

> An in-depth analysis of the Awesome-MLLM-Segmentation repository, covering over 30 cutting-edge studies from referring expression segmentation to open-vocabulary semantic segmentation, revealing how MLLMs are reshaping pixel-level understanding in computer vision.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-12T08:05:12.000Z
- 最近活动: 2026-04-12T08:18:45.294Z
- 热度: 163.8
- 关键词: 多模态大语言模型, 图像分割, 视频分割, 指代表达分割, 开放词汇语义分割, 推理分割, 计算机视觉, MLLM, SAM, LLaVA
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-mc-lan-awesome-mllm-segmentation
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-mc-lan-awesome-mllm-segmentation
- Markdown 来源: floors_fallback

---

## [Introduction] Multimodal Large Language Models Reshape the Paradigm of Image and Video Segmentation Technology

Based on the Awesome-MLLM-Segmentation repository, this article summarizes over 30 cutting-edge studies from top conferences/journals between 2023 and 2025, covering core directions such as referring expression segmentation, open-vocabulary semantic segmentation, video segmentation, and reasoning segmentation. It reveals how Multimodal Large Language Models (MLLMs) are reshaping pixel-level understanding of images and videos, and also includes applications in vertical fields like remote sensing and prospects for technical trends.

## Background: Limitations of Traditional Segmentation and the Transformation by MLLMs

Traditional image segmentation (semantic, instance, panoramic segmentation) requires task-specific architecture design and training processes. The rise of MLLMs like GPT-4V and LLaVA has extended powerful reasoning capabilities to the pixel level. Awesome-MLLM-Segmentation systematically collects key progress in this field, redefining the paradigm of segmentation technology.

## Referring Expression Segmentation: Breakthrough from Text to Precise Masks

Referring Expression Segmentation (RES) requires models to segment specific objects according to text descriptions:
- **LISA (CVPR 2024)**：First introduces reasoning capabilities, uses chain-of-thought to explain decisions, and embeds segmentation masks as visual tokens into the output space of language models;
- **GLaMM (CVPR 2024)**：Supports multi-object reference and complex interactions, with fine-grained pixel-level grounding;
- **PixelLM (CVPR 2024)**：Pixel attention mechanism improves segmentation accuracy in scenes with blurred boundaries or occlusions.

## Open-Vocabulary Semantic Segmentation: Breaking the Limitation of Predefined Categories

Open-vocabulary semantic segmentation breaks the limitation of predefined categories:
- **GSVA (CVPR 2024)**：Generalized segmentation concept, hierarchically aligns visual features with concept descriptions to achieve zero-shot generalization for new categories;
- **GROUNDHOG (CVPR 2024)**：Holistic segmentation, understanding all regions of the image (including background);
- **OMG-LLaVA (NeurIPS 2024)**：Unified architecture for handling multiple tasks like image classification, detection, and segmentation.

## Video Segmentation: Leap from Static to Dynamic

Video segmentation needs to handle spatiotemporal dynamics:
- **VISA (ECCV 2024)**：The first video MLLM segmentation framework, uses multi-turn dialogue to refine results, and a temporal consistency mechanism ensures inter-frame coherence;
- **VITRON (NeurIPS 2024)**：Unified pixel-level model supporting full-stack operations like understanding, segmentation, generation, and editing;
- **Sa2VA (ArXiv 2025)**：Combination of SAM2 and LLaVA, achieving breakthroughs in dense video understanding.

## Reasoning Segmentation: Segmentation That Teaches Models to 'Think'

Reasoning segmentation requires models to understand instructions before segmentation:
- **CoReS (ECCV 2024)**：Collaboration between reasoning and segmentation, with a bidirectional feedback mechanism to dynamically adjust strategies;
- **SegLLM (ICLR 2025)**：Multi-turn dialogue interaction to guide the model to approach target results;
- **Seg-Zero (ArXiv 2025)**：Cognitive reasoning chain guides segmentation, excelling in common-sense reasoning tasks.

## Vertical Applications: Exploration of MLLMs in Remote Sensing

Applications of MLLMs in remote sensing:
- **GeoGround (ArXiv 2024)**：The first large VLM for remote sensing visual localization, introducing geospatial priors to improve accuracy;
- **RSUniVLM (ArXiv 2024)**：Unified remote sensing VLM, with a granularity-guided mixture-of-experts architecture adapting to different resolutions;
- **GeoPix (ArXiv 2025)**：Pixel-level understanding for remote sensing, leading in multiple benchmarks.

## Technical Trends and Future Prospects

Technical Trends:
1. **Unified Architecture**: Such as OMG-LLaVA and VITRON, single models handling multiple tasks;
2. **Reasoning Capability**: The importance of interpretability is highlighted (LISA, CoReS);
3. **Deep Multimodal Fusion**: Fine-grained fusion strategies replace simple concatenation.
Future Prospects: Expectations for complex scene processing, natural interaction, and interpretable systems; need to explore topics like reducing computational costs, improving real-time performance, and ensuring result reliability.
