Reading

Modality Competition in Multimodal Models: A Multi-Level Variance Correction Method Based on Second-Order Optimization

This paper proposes the ML-FOP-SOAP optimization framework, which suppresses modality conflicts caused by cross-modal gradient heterogeneity via Fisher orthogonal projection. Experiments on Janus and Emu3 show that this method achieves stable training with a batch size of 8192, improves sample efficiency by 1.4x, and accelerates training speed by 1.5x.

ML-FOP-SOAP二阶优化多模态模型模态竞争SOAPFisher正交投影大规模训练统一多模态

Published 2026-05-16 00:45Recent activity 2026-05-18 16:23Estimated read 6 min

Modality Competition in Multimodal Models: A Multi-Level Variance Correction Method Based on Second-Order Optimization

Section 01

[Introduction] The ML-FOP-SOAP Framework Solves Modality Competition in Multimodal Models

This paper addresses the modality competition problem in unified multimodal model training and proposes the ML-FOP-SOAP optimization framework. The framework suppresses conflicts caused by cross-modal gradient heterogeneity via Fisher orthogonal projection. Its effectiveness is verified on Janus and Emu3 models: it supports stable training with a batch size of 8192, improves sample efficiency by 1.4x, accelerates training speed by 1.5x, and breaks the performance trade-off between visual and text modalities.

Section 02

Research Background: Optimization Challenges of Unified Multimodal Models

Autoregressive next-token prediction provides a unified training framework for image generation and text understanding. Models like Janus and Emu3 have shown potential, but they also bring modality competition issues: conflicts between visual and text gradient updates during training lead to loss oscillations, opposite gradient directions, hyperparameter sensitivity, and large-batch training collapse, which restrict large-scale training.

Section 03

Root Cause of the Problem: Limitations of First-Order Optimizers

The root cause of modality competition is that first-order optimizers (e.g., AdamW) are vulnerable to cross-modal gradient heterogeneity. Gradient heterogeneity manifests as: large visual gradient magnitudes (due to high-dimensional outputs) vs. small text gradients; often opposite directions (angle close to 180 degrees); different curvature properties (differences in Hessian spectra). AdamW relies only on first-order moments, processes parameters independently, and is sensitive to noise, so it cannot effectively handle this issue.

Section 04

Method Foundation: Advantages of Second-Order Preconditioning SOAP

Second-order preconditioning (e.g., SOAP) provides a stable foundation for multimodal alignment. SOAP combines Shampoo preconditioning, low-rank approximation, and adaptive momentum, and performs excellently in single-modal training. Compared to first-order methods, second-order methods can perceive curvature differences, correct update directions, and are robust to magnitude differences, but direct application still requires design for modality competition.

Section 05

ML-FOP-SOAP Framework: Core Design and Strategies

ML-FOP-SOAP is a second-order optimization framework specifically designed for multimodal models: 1. Core innovation: Fisher orthogonal projection—decomposes gradients into modality-shared and modality-specific components to suppress conflicts; 2. Multi-level variance correction: global (dynamically adjusts modality weights), layer-level (independent correction per layer), head-level (attention head correction); 3. Hierarchical folding strategy: micro-step incremental correction, controls overhead (<15%), supports large-batch training.

Section 06

Experimental Verification: Performance and Stability Improvements

Verified on Janus-1.3B and Emu3-8B: Compared to methods like AdamW, ML-FOP-SOAP reduces visual FID by 20%, increases text accuracy by 4.3%, achieves 1.4x sample efficiency and 1.5x training speed; AdamW diverges at batch size 8192, while ML-FOP-SOAP converges stably. Ablation experiments prove the necessity of Fisher projection, multi-level correction, and hierarchical folding.

Section 07

Technical Contributions and Practical Value

Theoretical contributions: Quantify cross-modal gradient heterogeneity, prove the advantages of second-order methods, and provide a Fisher geometric interpretation. Practical value: Reduce training costs (40% higher sample efficiency, 50% faster speed, supports large batches); improve model quality (both modalities improved, stable training). The team will open-source the PyTorch implementation, pre-training configurations, and training logs.

Section 08

Limitations and Future Directions

Current limitations: High computational overhead of second-order methods, large memory requirements, and only verified on autoregressive models. Future directions: Extend to audio/video modalities, combine with mixed-precision training, adaptive multi-level correction, and optimize communication efficiency in distributed training.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15