Reading

MiT: A New Efficient Fine-Tuning Method for Multimodal Large Models Without Adding Visual Tokens

MiT proposes a new multimodal information fusion method that directly injects visual features into the internal computation layers of LLMs instead of the traditional approach of adding visual tokens. It achieves efficient referring image segmentation tasks while only training 2.5% of the parameters.

多模态学习大语言模型参数高效微调CLIPLLaMA指代图像分割视觉语言模型注意力机制

Published 2026-06-09 13:40Recent activity 2026-06-09 13:50Estimated read 7 min

MiT: A New Efficient Fine-Tuning Method for Multimodal Large Models Without Adding Visual Tokens

Section 01

MiT: Guide to the New Efficient Fine-Tuning Method for Multimodal Models Without Adding Visual Tokens

Title: MiT: A New Efficient Fine-Tuning Method for Multimodal Large Models Without Adding Visual Tokens Core Idea: MiT proposes a new multimodal information fusion method that directly injects visual features into the internal computation layers of LLMs, replacing the traditional method of adding visual tokens. It can achieve efficient referring image segmentation tasks while only training 2.5% of the parameters. Advantages: Avoids sequence length expansion (no quadratic computational overhead), keeps LLM and visual encoder frozen, parameter-efficient. Source: GitHub project (author kiva12138, published on 2026-06-09, link: https://github.com/kiva12138/MiT)

Section 02

Efficiency Dilemma of Multimodal Large Models

With the improvement of LLM capabilities, multimodal expansion has become a hot topic. Traditional methods use visual encoder outputs as additional tokens concatenated to text sequences, but there are efficiency issues: the increase in the number of visual tokens leads to quadratic growth in self-attention computation complexity; high-resolution images or multi-frame videos cause sharp increases in computation and memory costs; full fine-tuning of large-scale models requires huge resources, which is difficult for most researchers to implement. Therefore, how to efficiently inject multimodal information while freezing LLMs is a key problem.

Section 03

Core Idea of MiT: Information Infusion Instead of Token Concatenation

The core idea of MiT (Multimodal Infusion Tuning) is to directly inject visual features into the internal computation layers of LLMs instead of converting them into tokens for concatenation. Its advantages include:

Avoids sequence length expansion, no quadratic self-attention overhead;
Base LLM (e.g., LLaMA) and visual encoder (e.g., CLIP) are fully frozen, only lightweight infusion modules are trained;
Parameter-efficient, only about 2.5% of parameters need to be trained. This method has been validated for effectiveness on referring image segmentation tasks (segmenting image targets based on text descriptions).

Section 04

Technical Details: Three-Layer Infusion Mechanism

MiT designs a three-layer infusion mechanism that linearly injects CLIP's global image features into selected layers of LLaMA:

Key-Value (K/V) Infusion: Maps image features to the text space via multiplicative and additive transformations, fuses with text Key/Value element-wise to softly modulate text representations;
Adaptive Head-Level Rescaling: Introduces learnable head-level vectors, combines the cosine similarity between text Value and image features, and uses sigmoid gating to adaptively adjust visual information infusion;
Feed-Forward Network (FFN) Infusion: Modulates hidden states via a gating mechanism to affect the model's nonlinear transformation process.

Section 05

Architecture Design and Implementation Details

Architecture Design:

Frozen base models: LLaMA-2-7B and CLIP-ViT-Large are fully frozen to retain pre-trained knowledge;
Lightweight modules: Only includes a few linear transformations and head-level parameters;
Last token pooling: Takes the hidden state of the last token of LLM as the infused text representation;
Lightweight segmentation decoder: Combines multi-level CLIP feature maps to generate segmentation masks.

Implementation Details: The code structure is modular, including Model.py (core model), DecoderTF.py (default segmentation decoder), ReferDataset.py (dataset loading), etc.; optimized for transformers 4.35.x, rewritten LLaMA attention logic to support the infusion mechanism.

Section 06

Experimental Validation and Dataset Support

MiT has been validated on multiple referring image segmentation datasets:

RefCOCO (19994 images, 142210 referring expressions);
RefCOCO+ (19992 images, 141564 referring expressions);
RefCOCOg (25799 images, 95010 referring expressions);
RefCLEF (based on the SAIAPR TC-12 image set). The project provides one-click download scripts and data validation tools to lower the threshold for reproduction.

Section 07

Technical Insights and Future Outlook

Technical Insights:

Internal infusion is superior to external concatenation, more efficient and flexible;
Freezing base models is feasible, new capabilities can be granted via adapters;
Different tasks require different infusion strategies, and the framework has good scalability.

Future Outlook: Expand to more modalities such as audio and video, apply to tasks like visual question answering and image caption generation; optimize the structure of infusion modules, reduce parameters, and improve interpretability.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49