Zing Forum

Reading

Modality Competition in Multimodal Models: A Multi-Level Variance Correction Method Based on Second-Order Optimization

This paper proposes the ML-FOP-SOAP optimization framework, which suppresses modality conflicts caused by cross-modal gradient heterogeneity via Fisher orthogonal projection. Experiments on Janus and Emu3 show that this method achieves stable training with a batch size of 8192, improves sample efficiency by 1.4x, and accelerates training speed by 1.5x.

ML-FOP-SOAP二阶优化多模态模型模态竞争SOAPFisher正交投影大规模训练统一多模态
Published 2026-05-16 00:45Recent activity 2026-05-18 16:23Estimated read 6 min
Modality Competition in Multimodal Models: A Multi-Level Variance Correction Method Based on Second-Order Optimization
1

Section 01

[Introduction] The ML-FOP-SOAP Framework Solves Modality Competition in Multimodal Models

This paper addresses the modality competition problem in unified multimodal model training and proposes the ML-FOP-SOAP optimization framework. The framework suppresses conflicts caused by cross-modal gradient heterogeneity via Fisher orthogonal projection. Its effectiveness is verified on Janus and Emu3 models: it supports stable training with a batch size of 8192, improves sample efficiency by 1.4x, accelerates training speed by 1.5x, and breaks the performance trade-off between visual and text modalities.

2

Section 02

Research Background: Optimization Challenges of Unified Multimodal Models

Autoregressive next-token prediction provides a unified training framework for image generation and text understanding. Models like Janus and Emu3 have shown potential, but they also bring modality competition issues: conflicts between visual and text gradient updates during training lead to loss oscillations, opposite gradient directions, hyperparameter sensitivity, and large-batch training collapse, which restrict large-scale training.

3

Section 03

Root Cause of the Problem: Limitations of First-Order Optimizers

The root cause of modality competition is that first-order optimizers (e.g., AdamW) are vulnerable to cross-modal gradient heterogeneity. Gradient heterogeneity manifests as: large visual gradient magnitudes (due to high-dimensional outputs) vs. small text gradients; often opposite directions (angle close to 180 degrees); different curvature properties (differences in Hessian spectra). AdamW relies only on first-order moments, processes parameters independently, and is sensitive to noise, so it cannot effectively handle this issue.

4

Section 04

Method Foundation: Advantages of Second-Order Preconditioning SOAP

Second-order preconditioning (e.g., SOAP) provides a stable foundation for multimodal alignment. SOAP combines Shampoo preconditioning, low-rank approximation, and adaptive momentum, and performs excellently in single-modal training. Compared to first-order methods, second-order methods can perceive curvature differences, correct update directions, and are robust to magnitude differences, but direct application still requires design for modality competition.

5

Section 05

ML-FOP-SOAP Framework: Core Design and Strategies

ML-FOP-SOAP is a second-order optimization framework specifically designed for multimodal models: 1. Core innovation: Fisher orthogonal projection—decomposes gradients into modality-shared and modality-specific components to suppress conflicts; 2. Multi-level variance correction: global (dynamically adjusts modality weights), layer-level (independent correction per layer), head-level (attention head correction); 3. Hierarchical folding strategy: micro-step incremental correction, controls overhead (<15%), supports large-batch training.

6

Section 06

Experimental Verification: Performance and Stability Improvements

Verified on Janus-1.3B and Emu3-8B: Compared to methods like AdamW, ML-FOP-SOAP reduces visual FID by 20%, increases text accuracy by 4.3%, achieves 1.4x sample efficiency and 1.5x training speed; AdamW diverges at batch size 8192, while ML-FOP-SOAP converges stably. Ablation experiments prove the necessity of Fisher projection, multi-level correction, and hierarchical folding.

7

Section 07

Technical Contributions and Practical Value

Theoretical contributions: Quantify cross-modal gradient heterogeneity, prove the advantages of second-order methods, and provide a Fisher geometric interpretation. Practical value: Reduce training costs (40% higher sample efficiency, 50% faster speed, supports large batches); improve model quality (both modalities improved, stable training). The team will open-source the PyTorch implementation, pre-training configurations, and training logs.

8

Section 08

Limitations and Future Directions

Current limitations: High computational overhead of second-order methods, large memory requirements, and only verified on autoregressive models. Future directions: Extend to audio/video modalities, combine with mixed-precision training, adaptive multi-level correction, and optimize communication efficiency in distributed training.