Section 01
MiT: Guide to the New Efficient Fine-Tuning Method for Multimodal Models Without Adding Visual Tokens
Title: MiT: A New Efficient Fine-Tuning Method for Multimodal Large Models Without Adding Visual Tokens Core Idea: MiT proposes a new multimodal information fusion method that directly injects visual features into the internal computation layers of LLMs, replacing the traditional method of adding visual tokens. It can achieve efficient referring image segmentation tasks while only training 2.5% of the parameters. Advantages: Avoids sequence length expansion (no quadratic computational overhead), keeps LLM and visual encoder frozen, parameter-efficient. Source: GitHub project (author kiva12138, published on 2026-06-09, link: https://github.com/kiva12138/MiT)