Zing Forum

Reading

Retrieval Dilemma of Multimodal Large Language Models: Why Strong Generative Capabilities Coexist with Weak Retrieval Performance

ACL 2026 research reveals that multimodal large language models (MLLMs) perform excellently in generative tasks but have systemic flaws in multimodal retrieval tasks. This article deeply analyzes the root causes and improvement directions.

多模态大语言模型跨模态检索生成式AIACL 2026对比学习模型评估表示学习
Published 2026-05-09 18:07Recent activity 2026-05-09 18:51Estimated read 5 min
Retrieval Dilemma of Multimodal Large Language Models: Why Strong Generative Capabilities Coexist with Weak Retrieval Performance
1

Section 01

[Introduction] The Gap Between Generative and Retrieval Capabilities of Multimodal Large Language Models

The ACL 2026 study Generative Giants, Retrieval Weaklings reveals: Multimodal Large Language Models (MLLMs) perform excellently in generative tasks such as image caption generation and visual question answering, but have systemic flaws in multimodal retrieval tasks. This article will deeply analyze the root causes of this phenomenon, experimental verification results, and improvement directions to help understand the capability boundaries of MLLMs.

2

Section 02

Research Background: Dual-Track Development of Multimodal AI and Intuitive Contradiction

Multimodal AI development has two main directions: generative tasks (e.g., image captioning, visual question answering, which require producing new content) and retrieval tasks (e.g., cross-modal matching, which requires finding the most relevant item from candidates). Intuitively, models with strong generative capabilities should be good at retrieval, but in practice, many top MLLMs perform mediocrely in retrieval evaluations, even lagging behind dedicated retrieval models.

3

Section 03

Core Findings: The Capability Gap Between Generation and Retrieval and Its Technical Reasons

The deep reasons why MLLMs have strong generative but weak retrieval capabilities:

  1. Architecture and Training Objective Differences: Autoregressive generative architectures optimize next-token prediction and do not directly optimize cross-modal similarity;
  2. Inconsistent Representation Spaces: Generative tasks do not require semantic space correspondence between input and output, while retrieval requires comparable representations in a shared embedding space;
  3. Training Data Bias: Focuses on descriptive content, lacking precise matching training;
  4. Mismatched Evaluation Metrics: Generative tasks use lenient semantic/ngram metrics, while retrieval uses strict precision/recall metrics.
4

Section 04

Experimental Verification: Systemic Gaps in MLLMs' Retrieval Performance

The research team tested mainstream MLLMs on multiple datasets:

  • Zero-shot retrieval performance is far lower than that of supervised dedicated retrieval models;
  • Limited improvement after fine-tuning, indicating that the flaws are rooted in architecture and pre-training objectives;
  • Unique error patterns: Difficulty distinguishing candidates with similar but not exactly matching semantics, and insensitivity to subtle differences (different from hallucinations/insufficient detail in generative tasks).
5

Section 05

Improvement Directions: How to Enhance MLLMs' Retrieval Capabilities?

Possible improvement paths:

  1. Hybrid Architecture: Retain generative capabilities while introducing dedicated retrieval modules;
  2. Optimize Pre-training Objectives: Explicitly integrate contrastive learning (already effective in pure vision-language pre-training);
  3. Retrieval-oriented Instruction Fine-tuning: Enable models to learn to compare and rank multimodal content.
6

Section 06

Implications for the Industry: Recommendations for Model Selection and System Design

Guidance from the research for the industry:

  1. Evaluate Capability Boundaries: Do not assume that strong generative capabilities mean strong retrieval capabilities; evaluate based on scenarios;
  2. Model Combination Strategy: For applications requiring both generation and retrieval, first use a dedicated retrieval model for filtering, then use MLLMs for in-depth analysis;
  3. Future Model Design: Balance generative and retrieval capabilities, or provide flexible configuration options.