Zing Forum

Reading

ROSE Framework: A New Paradigm for Image Segmentation Enabling Real-Time Knowledge Retrieval in Multimodal Large Models

To address the problem that multimodal large language models (MLLMs) cannot recognize emerging entities in image segmentation tasks, researchers propose the ROSE framework, which injects real-time web knowledge into models via retrieval-augmented generation (RAG) technology, achieving a 19.2 gIoU performance improvement on the NEST benchmark.

多模态大模型图像分割检索增强生成RAG新兴实体识别MLLM计算机视觉实时知识更新
Published 2026-04-16 01:59Recent activity 2026-04-16 11:19Estimated read 6 min
ROSE Framework: A New Paradigm for Image Segmentation Enabling Real-Time Knowledge Retrieval in Multimodal Large Models
1

Section 01

【Introduction】ROSE Framework: A New Paradigm for Image Segmentation Enabling Real-Time Knowledge Retrieval in Multimodal Large Models

To address the issue that multimodal large language models (MLLMs) cannot recognize emerging entities in image segmentation tasks, researchers propose the ROSE (Retrieval-Oriented Segmentation Enhancement) framework, which injects real-time web knowledge using retrieval-augmented generation technology. This framework achieves a 19.2 gIoU performance improvement on the Novel Emerging Segmentation Task (NEST) benchmark, providing a new paradigm for solving the limitations of static knowledge bases and enabling dynamic knowledge acquisition.

2

Section 02

Background and Challenges: The Difficulty of Recognizing Emerging Entities in MLLMs' Image Segmentation

Multimodal large language models have made significant progress in the field of image understanding, but they face a fundamental challenge in image segmentation tasks: recognizing and processing emerging entities. Traditional models (e.g., LISA) cannot recognize new concepts that appear after training or obtain the latest background information due to fixed training data. In real-world applications, when users request segmentation of "the latest iPhone" or "a newly announced tech product", models often fail to perform.

3

Section 03

NEST Task: A New Benchmark for Systematic Research on Emerging Entity Segmentation

To study this problem, researchers propose the Novel Emerging Segmentation Task (NEST), which divides the challenges into two categories: 1. Novel entities (new concepts that have never appeared in training data); 2. Emerging entities (existing related knowledge but requiring the latest external information). The team also built an automated data generation pipeline to extract real scenarios from news and establish a comprehensive NEST benchmark dataset.

4

Section 04

Core Architecture of ROSE Framework: Four Key Components for Plug-and-Play Enhancement

The ROSE framework consists of four key components:

  1. Web Retrieval-Augmented Generation Module: Receives multimodal input (image + text), retrieves web information in real time, and is optimized for visual-language tasks.
  2. Text Prompt Enhancer: Converts retrieved information into background knowledge prompts, e.g., injecting release date, specifications, appearance, etc., when querying "the latest foldable phone".
  3. Visual Prompt Enhancer: Retrieves relevant images for novel entities to build a visual example library, making up for the limitations of training data.
  4. WebSense Intelligent Scheduling Module: Analyzes input to determine whether to trigger retrieval, reducing unnecessary calls by 40% and balancing performance and efficiency.
5

Section 05

Technical Highlights: Deep Integration of RAG and Multimodal Segmentation

The innovation of ROSE lies in the deep integration of retrieval-augmented generation (RAG) and multimodal segmentation, breaking through the limitation that traditional RAG only serves text generation and extending it to pixel-level prediction tasks. The framework adopts a plug-and-play design, which can enhance any MLLM-based segmentation model without modifying the underlying architecture or retraining, lowering the threshold for implementation.

6

Section 06

Experimental Results: Significant Performance Improvement on NEST Benchmark

In the NEST benchmark test, ROSE performed excellently:

  • Compared with the strong retrieval baseline of Gemini-2.0 Flash, the gIoU metric improved by 19.2 points;
  • The synergy between text and visual prompts improved the segmentation accuracy of emerging entities;
  • The WebSense module reduced unnecessary retrieval calls by about 40%, balancing performance and efficiency.
7

Section 07

Application Prospects and Significance: The Shift from Static to Dynamic Knowledge Acquisition

The ROSE framework has broad application prospects:

  • Recognizing newly listed products in e-commerce scenarios;
  • Segmenting the main body of emerging events in news image analysis;
  • Tracking new trends in social media monitoring;
  • Identifying new traffic signs or vehicle types in autonomous driving. This work marks an important shift of multimodal AI from "static knowledge base" to "dynamic knowledge acquisition", laying the foundation for continuously learning visual systems.