Section 01
[Introduction] The ML-FOP-SOAP Framework Solves Modality Competition in Multimodal Models
This paper addresses the modality competition problem in unified multimodal model training and proposes the ML-FOP-SOAP optimization framework. The framework suppresses conflicts caused by cross-modal gradient heterogeneity via Fisher orthogonal projection. Its effectiveness is verified on Janus and Emu3 models: it supports stable training with a batch size of 8192, improves sample efficiency by 1.4x, accelerates training speed by 1.5x, and breaks the performance trade-off between visual and text modalities.