Zing Forum

Reading

MissRAG: A RAG Framework Addressing the Missing Modality Challenge in Multimodal Large Language Models

Introducing MissRAG—the first Retrieval-Augmented Generation (RAG) framework specifically designed to address the missing modality problem in multimodal large language models, supporting arbitrary combination retrieval of three modalities: audio, video, and text.

multimodalRAGICCV2025missing-modalityMLLMImageBindOneLLMVideoLLaMA
Published 2026-05-11 23:12Recent activity 2026-05-11 23:30Estimated read 6 min
MissRAG: A RAG Framework Addressing the Missing Modality Challenge in Multimodal Large Language Models
1

Section 01

MissRAG Framework Overview: An Innovative Solution to the Missing Modality Challenge in Multimodal Large Language Models

MissRAG is the first Retrieval-Augmented Generation (RAG) framework specifically designed to address the missing modality problem in multimodal large language models, supporting arbitrary combination retrieval of three modalities: audio, video, and text. It was developed by the AIMagelab team at the University of Modena, Italy, and the related research work has been accepted by ICCV 2025.

2

Section 02

Background: The Practical Dilemma of Multimodal AI Deployment—the Missing Modality Problem

In an ideal laboratory environment, multimodal large language models (MLLMs) have complete data input, but in reality, factors such as sensor failures, hardware limitations, privacy regulations, environmental noise, and data transmission errors often lead to missing or damaged modalities. This 'missing modality problem' is a core challenge for the deployment of multimodal AI. For example, when an autonomous driving camera is dazzled by strong light, a surveillance microphone is disturbed by rain, or a medical image sequence malfunctions, whether the model can work normally becomes a key issue.

3

Section 03

Core of the MissRAG Framework: Retrieval Augmentation Under Missing Modalities and Three-Modality Support

The core innovation of MissRAG lies in: when input modalities are missing, it retrieves the most similar alternative data from the modality prototype pool built from the training set, allowing the model to still generate high-quality outputs even with missing inputs. The framework supports arbitrary combination inputs of three modalities (audio, video, text)—single modality, dual modalities, or all three modalities. Developers do not need to train multiple model versions for different input combinations, making it highly adaptable.

4

Section 04

Technical Mechanism: Modality Prototype Retrieval Pool and Modality-Aware Prompt Engineering

MissRAG consists of two key components:

  1. Modality Prototype Retrieval Pool: Uses ImageBind as the contrastive embedder to extract aligned audio, video, and text embeddings from the training set to build the retrieval pool; precomputes modality tokens of training set samples and stores them in .h5 format to improve inference efficiency.
  2. Modality-Aware Prompt Engineering: Explicitly informs the model which modalities are missing in the prompt, guides the generation process, and helps the model adjust its reasoning strategy.
5

Section 05

Experimental Validation: Comprehensive Evaluation Across Tasks and Models

MissRAG was evaluated on five multimodal datasets, covering three types of tasks:

  • Audio-Visual Question Answering (Music AVQA dataset)
  • Audio-Visual Description (VALOR, CharadesEGO datasets)
  • Multimodal Sentiment Analysis (MOSI, MOSEI datasets) It was also adapted to three public MLLMs: OneLLM (7B), ChatBridge (13B), and VideoLLaMA 2 (7B), demonstrating its universality.
6

Section 06

Open-Source Resources and Usage Guide

The MissRAG team has open-sourced the modality pool and modality token data on Hugging Face. The code structure is clear, with separate implementation directories and instructions for each supported model. Usage process: Clone the repository → Create a Python environment for the corresponding model → Download datasets and precomputed modality tokens → Run evaluation scripts; the documentation also provides a prototype construction guide to help users build retrieval pools for their own datasets.

7

Section 07

Significance and Outlook: A Step from Ideal Scenarios to Real-World Applications

MissRAG marks an important step forward for multimodal RAG technology from ideal scenarios to real-world applications, providing a solution for data incompleteness in the real world. This idea is not only applicable to multimodal scenarios but also provides inspiration for improving the robustness of single-modal RAG systems. In the future, more 'fault-tolerant' AI systems are expected to emerge, making optimal decisions under imperfect inputs.