Reading

PUMA: A Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval

The PUMA method proposed by Harbin Institute of Technology (Shenzhen) addresses the efficiency challenges of multimodal large language models (MLLMs) in unified multimodal retrieval tasks through layer-pruned self-distillation and modality-adaptive contrastive learning loss, significantly reducing the number of parameters while maintaining retrieval performance.

多模态检索模型剪枝自蒸馏对比学习视觉语言模型Qwen2-VLLoRA机器学习计算机视觉

Published 2026-06-07 02:33Recent activity 2026-06-07 02:52Estimated read 7 min

Section 01

Introduction / Main Floor: PUMA: A Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval

Section 02

Original Authors and Sources

Original Author/Maintainer: iLearn Lab, Harbin Institute of Technology (Shenzhen)
Authors: Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, Liqiang Nie
Source Platform: GitHub
Original Title: PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning
Original Link: https://github.com/iLearn-Lab/ACM-MM25-PUMA
Paper Link: https://arxiv.org/abs/2507.08064
Conference: ACM MM 2025
Release Date: June 6, 2026

Section 03

Research Background and Challenges

Unified Multimodal Retrieval (UMR) is one of the important application scenarios for Multimodal Large Language Models (MLLMs). It requires models to perform semantic alignment and retrieval across multiple modalities such as images and text. However, existing MLLMs face severe efficiency challenges in UMR tasks:

Huge number of parameters: Mainstream MLLMs usually contain billions of parameters, leading to high inference costs
High computational overhead: Full model forward propagation requires a lot of computing resources
Difficult deployment: Hard to deploy in resource-constrained practical application scenarios

How to significantly reduce the model's computational overhead while maintaining retrieval performance has become a key issue in the practical application of UMR.

Section 04

Overview of the PUMA Method

The research team from Harbin Institute of Technology (Shenzhen) proposed PUMA (Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval) to address efficiency challenges from two perspectives: model structure and model learning.

Section 05

1. Layer-Pruned Self-Distillation

From the perspective of model structure, PUMA significantly reduces the number of parameters of MLLMs by structurally pruning the model and retaining only shallow layers. This method does not simply discard deep layers; instead, it uses a self-distillation mechanism to allow the pruned shallow model to learn the knowledge of the complete model, thus maintaining performance while reducing parameters.

Section 06

2. Modality-Adaptive Contrastive Learning Loss (MAC-Loss)

From the perspective of model learning, PUMA proposes the Modality-Adaptive Contrastive Learning Loss (MAC-Loss). This loss function can:

Adaptively separate hard negative samples: Adaptively divide negative candidate samples in a batch into harder-to-learn intra-modality negative samples and relatively easier inter-modality negative samples
Dynamic temperature strategy: Combine a dynamic temperature strategy to achieve zero-cost hard negative sampling

This design allows the model to learn cross-modal alignment more effectively while avoiding the additional computational overhead of traditional hard negative sampling methods.

Section 07

Model Architecture

PUMA is based on the Qwen2-VL architecture. It retains the first k layers through layer pruning and then uses LoRA (Low-Rank Adaptation) for fine-tuning. The specific process includes:

Layer pruning: Copy and retain the first k layers of the model
Self-distillation training: Use the complete model as the teacher model to guide the learning of the pruned student model
Two-stage fine-tuning:
- Stage 1: Perform initial fine-tuning using distillation loss
- Stage 2: Perform fine adjustment using MAC-Loss

Section 08

MAC-Loss Mechanism

The core idea of MAC-Loss is to dynamically adjust the difficulty of contrastive learning based on the modality source of the samples:

Intra-modality negative samples: Negative samples from the same modality as the query sample, which are usually harder to distinguish
Inter-modality negative samples: Negative samples from different modalities than the query sample, which are relatively easier to distinguish

By adaptively adjusting the weights of these two types of negative samples, MAC-Loss allows the model to focus more on truly difficult samples while avoiding wasting computing resources on easily distinguishable samples.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49