# Modality Competition in Multimodal Models: A Multi-Level Variance Correction Method Based on Second-Order Optimization

> This paper proposes the ML-FOP-SOAP optimization framework, which suppresses modality conflicts caused by cross-modal gradient heterogeneity via Fisher orthogonal projection. Experiments on Janus and Emu3 show that this method achieves stable training with a batch size of 8192, improves sample efficiency by 1.4x, and accelerates training speed by 1.5x.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-15T16:45:56.000Z
- 最近活动: 2026-05-18T08:23:41.213Z
- 热度: 96.4
- 关键词: ML-FOP-SOAP, 二阶优化, 多模态模型, 模态竞争, SOAP, Fisher正交投影, 大规模训练, 统一多模态
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2605-16165v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2605-16165v1
- Markdown 来源: floors_fallback

---

## [Introduction] The ML-FOP-SOAP Framework Solves Modality Competition in Multimodal Models

This paper addresses the modality competition problem in unified multimodal model training and proposes the ML-FOP-SOAP optimization framework. The framework suppresses conflicts caused by cross-modal gradient heterogeneity via Fisher orthogonal projection. Its effectiveness is verified on Janus and Emu3 models: it supports stable training with a batch size of 8192, improves sample efficiency by 1.4x, accelerates training speed by 1.5x, and breaks the performance trade-off between visual and text modalities.

## Research Background: Optimization Challenges of Unified Multimodal Models

Autoregressive next-token prediction provides a unified training framework for image generation and text understanding. Models like Janus and Emu3 have shown potential, but they also bring modality competition issues: conflicts between visual and text gradient updates during training lead to loss oscillations, opposite gradient directions, hyperparameter sensitivity, and large-batch training collapse, which restrict large-scale training.

## Root Cause of the Problem: Limitations of First-Order Optimizers

The root cause of modality competition is that first-order optimizers (e.g., AdamW) are vulnerable to cross-modal gradient heterogeneity. Gradient heterogeneity manifests as: large visual gradient magnitudes (due to high-dimensional outputs) vs. small text gradients; often opposite directions (angle close to 180 degrees); different curvature properties (differences in Hessian spectra). AdamW relies only on first-order moments, processes parameters independently, and is sensitive to noise, so it cannot effectively handle this issue.

## Method Foundation: Advantages of Second-Order Preconditioning SOAP

Second-order preconditioning (e.g., SOAP) provides a stable foundation for multimodal alignment. SOAP combines Shampoo preconditioning, low-rank approximation, and adaptive momentum, and performs excellently in single-modal training. Compared to first-order methods, second-order methods can perceive curvature differences, correct update directions, and are robust to magnitude differences, but direct application still requires design for modality competition.

## ML-FOP-SOAP Framework: Core Design and Strategies

ML-FOP-SOAP is a second-order optimization framework specifically designed for multimodal models: 1. Core innovation: Fisher orthogonal projection—decomposes gradients into modality-shared and modality-specific components to suppress conflicts; 2. Multi-level variance correction: global (dynamically adjusts modality weights), layer-level (independent correction per layer), head-level (attention head correction); 3. Hierarchical folding strategy: micro-step incremental correction, controls overhead (<15%), supports large-batch training.

## Experimental Verification: Performance and Stability Improvements

Verified on Janus-1.3B and Emu3-8B: Compared to methods like AdamW, ML-FOP-SOAP reduces visual FID by 20%, increases text accuracy by 4.3%, achieves 1.4x sample efficiency and 1.5x training speed; AdamW diverges at batch size 8192, while ML-FOP-SOAP converges stably. Ablation experiments prove the necessity of Fisher projection, multi-level correction, and hierarchical folding.

## Technical Contributions and Practical Value

Theoretical contributions: Quantify cross-modal gradient heterogeneity, prove the advantages of second-order methods, and provide a Fisher geometric interpretation. Practical value: Reduce training costs (40% higher sample efficiency, 50% faster speed, supports large batches); improve model quality (both modalities improved, stable training). The team will open-source the PyTorch implementation, pre-training configurations, and training logs.

## Limitations and Future Directions

Current limitations: High computational overhead of second-order methods, large memory requirements, and only verified on autoregressive models. Future directions: Extend to audio/video modalities, combine with mixed-precision training, adaptive multi-level correction, and optimize communication efficiency in distributed training.