Section 01
[Introduction] AnisoAlign Framework: A New Approach to Resolving Modality Gaps in Multimodal Representation Spaces
This article introduces the new AnisoAlign framework, which targets the modality gap problem in multimodal large language model training. Through geometric analysis, it finds that the essence of the modality gap is an anisotropic residual structure concentrated in a few dominant directions (not a simple global shift). It proposes an anisotropic alignment principle and a bounded correction method, which effectively improve the performance of multimodal models trained with single-modality data and provide a solution to alleviate the scarcity of multimodal data.