Zing Forum

Reading

PUMA: A Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval

The PUMA method proposed by Harbin Institute of Technology (Shenzhen) addresses the efficiency challenges of multimodal large language models (MLLMs) in unified multimodal retrieval tasks through layer-pruned self-distillation and modality-adaptive contrastive learning loss, significantly reducing the number of parameters while maintaining retrieval performance.

多模态检索模型剪枝自蒸馏对比学习视觉语言模型Qwen2-VLLoRA机器学习计算机视觉
Published 2026-06-07 02:33Recent activity 2026-06-07 02:52Estimated read 7 min
PUMA: A Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval
1

Section 01

Introduction / Main Floor: PUMA: A Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval

The PUMA method proposed by Harbin Institute of Technology (Shenzhen) addresses the efficiency challenges of multimodal large language models (MLLMs) in unified multimodal retrieval tasks through layer-pruned self-distillation and modality-adaptive contrastive learning loss, significantly reducing the number of parameters while maintaining retrieval performance.

2

Section 02

Original Authors and Sources

  • Original Author/Maintainer: iLearn Lab, Harbin Institute of Technology (Shenzhen)
  • Authors: Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, Liqiang Nie
  • Source Platform: GitHub
  • Original Title: PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning
  • Original Link: https://github.com/iLearn-Lab/ACM-MM25-PUMA
  • Paper Link: https://arxiv.org/abs/2507.08064
  • Conference: ACM MM 2025
  • Release Date: June 6, 2026
3

Section 03

Research Background and Challenges

Unified Multimodal Retrieval (UMR) is one of the important application scenarios for Multimodal Large Language Models (MLLMs). It requires models to perform semantic alignment and retrieval across multiple modalities such as images and text. However, existing MLLMs face severe efficiency challenges in UMR tasks:

  1. Huge number of parameters: Mainstream MLLMs usually contain billions of parameters, leading to high inference costs
  2. High computational overhead: Full model forward propagation requires a lot of computing resources
  3. Difficult deployment: Hard to deploy in resource-constrained practical application scenarios

How to significantly reduce the model's computational overhead while maintaining retrieval performance has become a key issue in the practical application of UMR.

4

Section 04

Overview of the PUMA Method

The research team from Harbin Institute of Technology (Shenzhen) proposed PUMA (Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval) to address efficiency challenges from two perspectives: model structure and model learning.

5

Section 05

1. Layer-Pruned Self-Distillation

From the perspective of model structure, PUMA significantly reduces the number of parameters of MLLMs by structurally pruning the model and retaining only shallow layers. This method does not simply discard deep layers; instead, it uses a self-distillation mechanism to allow the pruned shallow model to learn the knowledge of the complete model, thus maintaining performance while reducing parameters.

6

Section 06

2. Modality-Adaptive Contrastive Learning Loss (MAC-Loss)

From the perspective of model learning, PUMA proposes the Modality-Adaptive Contrastive Learning Loss (MAC-Loss). This loss function can:

  • Adaptively separate hard negative samples: Adaptively divide negative candidate samples in a batch into harder-to-learn intra-modality negative samples and relatively easier inter-modality negative samples
  • Dynamic temperature strategy: Combine a dynamic temperature strategy to achieve zero-cost hard negative sampling

This design allows the model to learn cross-modal alignment more effectively while avoiding the additional computational overhead of traditional hard negative sampling methods.

7

Section 07

Model Architecture

PUMA is based on the Qwen2-VL architecture. It retains the first k layers through layer pruning and then uses LoRA (Low-Rank Adaptation) for fine-tuning. The specific process includes:

  1. Layer pruning: Copy and retain the first k layers of the model
  2. Self-distillation training: Use the complete model as the teacher model to guide the learning of the pruned student model
  3. Two-stage fine-tuning:
    • Stage 1: Perform initial fine-tuning using distillation loss
    • Stage 2: Perform fine adjustment using MAC-Loss
8

Section 08

MAC-Loss Mechanism

The core idea of MAC-Loss is to dynamically adjust the difficulty of contrastive learning based on the modality source of the samples:

  • Intra-modality negative samples: Negative samples from the same modality as the query sample, which are usually harder to distinguish
  • Inter-modality negative samples: Negative samples from different modalities than the query sample, which are relatively easier to distinguish

By adaptively adjusting the weights of these two types of negative samples, MAC-Loss allows the model to focus more on truly difficult samples while avoiding wasting computing resources on easily distinguishable samples.