Zing Forum

Reading

MiT: A New Efficient Fine-Tuning Method for Multimodal Large Models Without Adding Visual Tokens

MiT proposes a new multimodal information fusion method that directly injects visual features into the internal computation layers of LLMs instead of the traditional approach of adding visual tokens. It achieves efficient referring image segmentation tasks while only training 2.5% of the parameters.

多模态学习大语言模型参数高效微调CLIPLLaMA指代图像分割视觉语言模型注意力机制
Published 2026-06-09 13:40Recent activity 2026-06-09 13:50Estimated read 7 min
MiT: A New Efficient Fine-Tuning Method for Multimodal Large Models Without Adding Visual Tokens
1

Section 01

MiT: Guide to the New Efficient Fine-Tuning Method for Multimodal Models Without Adding Visual Tokens

Title: MiT: A New Efficient Fine-Tuning Method for Multimodal Large Models Without Adding Visual Tokens Core Idea: MiT proposes a new multimodal information fusion method that directly injects visual features into the internal computation layers of LLMs, replacing the traditional method of adding visual tokens. It can achieve efficient referring image segmentation tasks while only training 2.5% of the parameters. Advantages: Avoids sequence length expansion (no quadratic computational overhead), keeps LLM and visual encoder frozen, parameter-efficient. Source: GitHub project (author kiva12138, published on 2026-06-09, link: https://github.com/kiva12138/MiT)

2

Section 02

Efficiency Dilemma of Multimodal Large Models

With the improvement of LLM capabilities, multimodal expansion has become a hot topic. Traditional methods use visual encoder outputs as additional tokens concatenated to text sequences, but there are efficiency issues: the increase in the number of visual tokens leads to quadratic growth in self-attention computation complexity; high-resolution images or multi-frame videos cause sharp increases in computation and memory costs; full fine-tuning of large-scale models requires huge resources, which is difficult for most researchers to implement. Therefore, how to efficiently inject multimodal information while freezing LLMs is a key problem.

3

Section 03

Core Idea of MiT: Information Infusion Instead of Token Concatenation

The core idea of MiT (Multimodal Infusion Tuning) is to directly inject visual features into the internal computation layers of LLMs instead of converting them into tokens for concatenation. Its advantages include:

  1. Avoids sequence length expansion, no quadratic self-attention overhead;
  2. Base LLM (e.g., LLaMA) and visual encoder (e.g., CLIP) are fully frozen, only lightweight infusion modules are trained;
  3. Parameter-efficient, only about 2.5% of parameters need to be trained. This method has been validated for effectiveness on referring image segmentation tasks (segmenting image targets based on text descriptions).
4

Section 04

Technical Details: Three-Layer Infusion Mechanism

MiT designs a three-layer infusion mechanism that linearly injects CLIP's global image features into selected layers of LLaMA:

  1. Key-Value (K/V) Infusion: Maps image features to the text space via multiplicative and additive transformations, fuses with text Key/Value element-wise to softly modulate text representations;
  2. Adaptive Head-Level Rescaling: Introduces learnable head-level vectors, combines the cosine similarity between text Value and image features, and uses sigmoid gating to adaptively adjust visual information infusion;
  3. Feed-Forward Network (FFN) Infusion: Modulates hidden states via a gating mechanism to affect the model's nonlinear transformation process.
5

Section 05

Architecture Design and Implementation Details

Architecture Design:

  • Frozen base models: LLaMA-2-7B and CLIP-ViT-Large are fully frozen to retain pre-trained knowledge;
  • Lightweight modules: Only includes a few linear transformations and head-level parameters;
  • Last token pooling: Takes the hidden state of the last token of LLM as the infused text representation;
  • Lightweight segmentation decoder: Combines multi-level CLIP feature maps to generate segmentation masks.

Implementation Details: The code structure is modular, including Model.py (core model), DecoderTF.py (default segmentation decoder), ReferDataset.py (dataset loading), etc.; optimized for transformers 4.35.x, rewritten LLaMA attention logic to support the infusion mechanism.

6

Section 06

Experimental Validation and Dataset Support

MiT has been validated on multiple referring image segmentation datasets:

  • RefCOCO (19994 images, 142210 referring expressions);
  • RefCOCO+ (19992 images, 141564 referring expressions);
  • RefCOCOg (25799 images, 95010 referring expressions);
  • RefCLEF (based on the SAIAPR TC-12 image set). The project provides one-click download scripts and data validation tools to lower the threshold for reproduction.
7

Section 07

Technical Insights and Future Outlook

Technical Insights:

  1. Internal infusion is superior to external concatenation, more efficient and flexible;
  2. Freezing base models is feasible, new capabilities can be granted via adapters;
  3. Different tasks require different infusion strategies, and the framework has good scalability.

Future Outlook: Expand to more modalities such as audio and video, apply to tasks like visual question answering and image caption generation; optimize the structure of infusion modules, reduce parameters, and improve interpretability.