Zing Forum

Reading

DiM³: No Retraining Needed—Endowing Multimodal Models with Multilingual Capabilities via Direction and Magnitude-Aware Merging

DiM³ proposes a training-free method that injects multilingual capabilities into multimodal models across 57 languages by selectively merging multilingual and multimodal parameter updates, achieving performance comparable to dedicated multilingual multimodal fine-tuning.

多模态模型多语言模型参数合并模型融合免训练方法跨语言对齐LLaVAQwen
Published 2026-05-13 11:50Recent activity 2026-05-14 10:48Estimated read 4 min
DiM³: No Retraining Needed—Endowing Multimodal Models with Multilingual Capabilities via Direction and Magnitude-Aware Merging
1

Section 01

DiM³: An Innovative Method to Endow Multimodal Models with Multilingual Capabilities Without Retraining

DiM³ proposes a training-free method that injects multilingual capabilities into multimodal models across 57 languages via direction and magnitude-aware parameter merging. Its performance is comparable to dedicated multilingual multimodal fine-tuning, solving the high-cost problem of traditionally integrating multilingual and multimodal capabilities.

2

Section 02

Traditional Challenges in Integrating Multilingual Multimodal Models

Current large multimodal models (e.g., LLaVA, Qwen-VL) excel in visual understanding but are primarily designed for English users. Traditionally, endowing them with multilingual capabilities requires building large-scale multilingual multimodal datasets and end-to-end retraining, which is costly. Additionally, multilingual and multimodal updates conflict in shared language backbones, and simple merging often leads to performance degradation.

3

Section 03

Core of DiM³: Direction and Magnitude-Aware Selective Parameter Merging

DiM³ frames the problem as selective merging in parameter space, analyzing the geometric properties of multilingual and multimodal updates: direction awareness identifies complementary/conflicting dimensions (enhancing when aligned, balancing when conflicting); magnitude awareness evaluates parameter sensitivity to balance the importance of capabilities, achieving coverage of 57 languages while balancing multimodal abilities.

4

Section 04

Technical Implementation of DiM³: Freeze Visual Components, Merge Only Language Backbone

DiM³ keeps the visual encoder and multimodal projector of the original multimodal model unchanged, merging only the parameters of the language model backbone. This strategy avoids disrupting the learned vision-language alignment, simplifies the merging process, and enhances the method's generality and transferability.

5

Section 05

Experimental Validation of DiM³: Performance Comparable to Dedicated Fine-Tuning

Validated on LLaVA and Qwen architectures across 57 languages: text tasks significantly outperform the original models; vision-language tasks demonstrate cross-lingual capabilities; performance is comparable to dedicated fine-tuned models while preserving the original multimodal abilities.

6

Section 06

Interpretability and Practical Application Value of DiM³

Interpretability analysis shows: DiM³ primarily affects the middle layers of the language model (reshaping semantic representations), while the top layers retain task structures (preserving multimodal capabilities), enabling a unified cross-lingual-multimodal representation. In practical applications, it can quickly empower existing models to expand their language capabilities, and the code has been open-sourced.

7

Section 07

Limitations and Future Directions of DiM³

Limitations: Effectiveness is limited for low-resource languages. Future directions: Explore mixing methods for low-resource languages, expand to more modalities such as audio/video, and study dynamic merging strategies.