# PUMA: A Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval

> The PUMA method proposed by Harbin Institute of Technology (Shenzhen) addresses the efficiency challenges of multimodal large language models (MLLMs) in unified multimodal retrieval tasks through layer-pruned self-distillation and modality-adaptive contrastive learning loss, significantly reducing the number of parameters while maintaining retrieval performance.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-06T18:33:35.000Z
- 最近活动: 2026-06-06T18:52:59.736Z
- 热度: 161.7
- 关键词: 多模态检索, 模型剪枝, 自蒸馏, 对比学习, 视觉语言模型, Qwen2-VL, LoRA, 机器学习, 计算机视觉
- 页面链接: https://www.zingnex.cn/en/forum/thread/puma-f25ccdc7
- Canonical: https://www.zingnex.cn/forum/thread/puma-f25ccdc7
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: PUMA: A Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval

The PUMA method proposed by Harbin Institute of Technology (Shenzhen) addresses the efficiency challenges of multimodal large language models (MLLMs) in unified multimodal retrieval tasks through layer-pruned self-distillation and modality-adaptive contrastive learning loss, significantly reducing the number of parameters while maintaining retrieval performance.

## Original Authors and Sources

- **Original Author/Maintainer:** iLearn Lab, Harbin Institute of Technology (Shenzhen)
- **Authors:** Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, Liqiang Nie
- **Source Platform:** GitHub
- **Original Title:** PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning
- **Original Link:** https://github.com/iLearn-Lab/ACM-MM25-PUMA
- **Paper Link:** https://arxiv.org/abs/2507.08064
- **Conference:** ACM MM 2025
- **Release Date:** June 6, 2026

## Research Background and Challenges

Unified Multimodal Retrieval (UMR) is one of the important application scenarios for Multimodal Large Language Models (MLLMs). It requires models to perform semantic alignment and retrieval across multiple modalities such as images and text. However, existing MLLMs face severe efficiency challenges in UMR tasks:

1. **Huge number of parameters**: Mainstream MLLMs usually contain billions of parameters, leading to high inference costs
2. **High computational overhead**: Full model forward propagation requires a lot of computing resources
3. **Difficult deployment**: Hard to deploy in resource-constrained practical application scenarios

How to significantly reduce the model's computational overhead while maintaining retrieval performance has become a key issue in the practical application of UMR.

## Overview of the PUMA Method

The research team from Harbin Institute of Technology (Shenzhen) proposed PUMA (Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval) to address efficiency challenges from two perspectives: model structure and model learning.

## 1. Layer-Pruned Self-Distillation

From the perspective of model structure, PUMA significantly reduces the number of parameters of MLLMs by structurally pruning the model and retaining only shallow layers. This method does not simply discard deep layers; instead, it uses a self-distillation mechanism to allow the pruned shallow model to learn the knowledge of the complete model, thus maintaining performance while reducing parameters.

## 2. Modality-Adaptive Contrastive Learning Loss (MAC-Loss)

From the perspective of model learning, PUMA proposes the Modality-Adaptive Contrastive Learning Loss (MAC-Loss). This loss function can:

- **Adaptively separate hard negative samples**: Adaptively divide negative candidate samples in a batch into harder-to-learn intra-modality negative samples and relatively easier inter-modality negative samples
- **Dynamic temperature strategy**: Combine a dynamic temperature strategy to achieve zero-cost hard negative sampling

This design allows the model to learn cross-modal alignment more effectively while avoiding the additional computational overhead of traditional hard negative sampling methods.

## Model Architecture

PUMA is based on the Qwen2-VL architecture. It retains the first k layers through layer pruning and then uses LoRA (Low-Rank Adaptation) for fine-tuning. The specific process includes:

1. **Layer pruning**: Copy and retain the first k layers of the model
2. **Self-distillation training**: Use the complete model as the teacher model to guide the learning of the pruned student model
3. **Two-stage fine-tuning**:
   - Stage 1: Perform initial fine-tuning using distillation loss
   - Stage 2: Perform fine adjustment using MAC-Loss

## MAC-Loss Mechanism

The core idea of MAC-Loss is to dynamically adjust the difficulty of contrastive learning based on the modality source of the samples:

- **Intra-modality negative samples**: Negative samples from the same modality as the query sample, which are usually harder to distinguish
- **Inter-modality negative samples**: Negative samples from different modalities than the query sample, which are relatively easier to distinguish

By adaptively adjusting the weights of these two types of negative samples, MAC-Loss allows the model to focus more on truly difficult samples while avoiding wasting computing resources on easily distinguishable samples.
