# DiM³: No Retraining Needed—Endowing Multimodal Models with Multilingual Capabilities via Direction and Magnitude-Aware Merging

> DiM³ proposes a training-free method that injects multilingual capabilities into multimodal models across 57 languages by selectively merging multilingual and multimodal parameter updates, achieving performance comparable to dedicated multilingual multimodal fine-tuning.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-13T03:50:54.000Z
- 最近活动: 2026-05-14T02:48:42.179Z
- 热度: 128.0
- 关键词: 多模态模型, 多语言模型, 参数合并, 模型融合, 免训练方法, 跨语言对齐, LLaVA, Qwen
- 页面链接: https://www.zingnex.cn/en/forum/thread/dim3
- Canonical: https://www.zingnex.cn/forum/thread/dim3
- Markdown 来源: floors_fallback

---

## DiM³: An Innovative Method to Endow Multimodal Models with Multilingual Capabilities Without Retraining

DiM³ proposes a training-free method that injects multilingual capabilities into multimodal models across 57 languages via direction and magnitude-aware parameter merging. Its performance is comparable to dedicated multilingual multimodal fine-tuning, solving the high-cost problem of traditionally integrating multilingual and multimodal capabilities.

## Traditional Challenges in Integrating Multilingual Multimodal Models

Current large multimodal models (e.g., LLaVA, Qwen-VL) excel in visual understanding but are primarily designed for English users. Traditionally, endowing them with multilingual capabilities requires building large-scale multilingual multimodal datasets and end-to-end retraining, which is costly. Additionally, multilingual and multimodal updates conflict in shared language backbones, and simple merging often leads to performance degradation.

## Core of DiM³: Direction and Magnitude-Aware Selective Parameter Merging

DiM³ frames the problem as selective merging in parameter space, analyzing the geometric properties of multilingual and multimodal updates: direction awareness identifies complementary/conflicting dimensions (enhancing when aligned, balancing when conflicting); magnitude awareness evaluates parameter sensitivity to balance the importance of capabilities, achieving coverage of 57 languages while balancing multimodal abilities.

## Technical Implementation of DiM³: Freeze Visual Components, Merge Only Language Backbone

DiM³ keeps the visual encoder and multimodal projector of the original multimodal model unchanged, merging only the parameters of the language model backbone. This strategy avoids disrupting the learned vision-language alignment, simplifies the merging process, and enhances the method's generality and transferability.

## Experimental Validation of DiM³: Performance Comparable to Dedicated Fine-Tuning

Validated on LLaVA and Qwen architectures across 57 languages: text tasks significantly outperform the original models; vision-language tasks demonstrate cross-lingual capabilities; performance is comparable to dedicated fine-tuned models while preserving the original multimodal abilities.

## Interpretability and Practical Application Value of DiM³

Interpretability analysis shows: DiM³ primarily affects the middle layers of the language model (reshaping semantic representations), while the top layers retain task structures (preserving multimodal capabilities), enabling a unified cross-lingual-multimodal representation. In practical applications, it can quickly empower existing models to expand their language capabilities, and the code has been open-sourced.

## Limitations and Future Directions of DiM³

Limitations: Effectiveness is limited for low-resource languages. Future directions: Explore mixing methods for low-resource languages, expand to more modalities such as audio/video, and study dynamic merging strategies.
