Zing Forum

Reading

The Spectral Filtering Nature of Momentum in the Muon Optimizer: Denoise First, Orthogonalize Later

The study reveals the theoretical role of momentum in the Muon optimizer: momentum acts as a spectral filter, which suppresses disturbances and preserves dominant signals under the structured signal plus perturbed gradient model, providing a more stable singular subspace for the orthogonalization step.

Muon优化器动量谱滤波正交化大语言模型训练优化理论
Published 2026-06-03 00:54Recent activity 2026-06-03 13:22Estimated read 7 min
The Spectral Filtering Nature of Momentum in the Muon Optimizer: Denoise First, Orthogonalize Later
1

Section 01

Introduction to The Spectral Filtering Nature of Momentum in the Muon Optimizer: Denoise First, Orthogonalize Later

Core Insights

The study reveals the theoretical role of momentum in the Muon optimizer: momentum acts as a spectral filter, which suppresses disturbances and preserves dominant signals under the structured signal plus perturbed gradient model, amplifies the spectral gap to stabilize the singular subspace of the orthogonalization step; additionally, the order of 'calculating momentum first, then orthogonalizing' is critical, and the theory has been experimentally verified.

Original Article Information

  • Original Authors: arXiv Author Team
  • Source: arXiv (published on June 2, 2026)
  • Original Title: Denoise First, Orthogonalize Later: Understanding Momentum in Muon via Spectral Filtering
  • Original Link: http://arxiv.org/abs/2606.03899v1
2

Section 02

The Rise of the Muon Optimizer and Theoretical Gaps

The Muon optimizer has recently demonstrated strong empirical performance in large language model training, attracting widespread attention. However, key theoretical questions remain unresolved: what exactly is the role of momentum in Muon? Existing analyses either remove momentum to study spectral updates in isolation, or retain momentum but fail to explain why it improves performance. This theoretical ambiguity limits understanding of Muon's working principles and hinders further optimization and expansion.

3

Section 03

Core Finding: Momentum as a Spectral Filter

The paper fills the gap through rigorous theoretical analysis: momentum in Muon is actually a spectral filter. Under the model where gradients are decomposed into structured signals plus random disturbances, momentum amplifies persistent signal components through temporal cumulative averaging and weakens random disturbances with inconsistent directions; in the spectral domain, this manifests as enhancing the eigenvalues of dominant signals and suppressing those of noise.

The filtering effect of momentum amplifies the spectral gap between signals and disturbances, which is crucial for orthogonalization—orthogonalization relies on the stability of the singular subspace of the input matrix. A small spectral gap easily leads to significant changes in singular vectors due to disturbances, while an amplified spectral gap stabilizes the singular subspace of the matrix passed to the orthogonalization step, making updates more reliable and consistent.

4

Section 04

Importance of Order: Denoise First, Orthogonalize Later

The paper proves the criticality of the operation order: applying momentum before orthogonalization provides a stronger guarantee of alignment with gradient signal components compared to reversing the order or removing momentum entirely.

This explains the excellence of Muon's 'momentum first, orthogonalization later' design: this order ensures that the matrix input to orthogonalization has been purified, with its main structure reflecting the true optimization direction rather than noise.

5

Section 05

Experimental Verification and Hyperparameter Tuning

The theoretical analysis has been verified on diverse tasks (including large language model pre-training), and the experimental results are highly consistent with theoretical predictions, supporting the interpretation of momentum as a spectral filter.

Based on this understanding, researchers can adjust Muon's hyperparameters (such as momentum coefficient, orthogonalization frequency) in a targeted manner to achieve better performance on specific tasks.

6

Section 06

Broader Significance of the Research

The significance of this work goes beyond Muon itself: it provides a theoretical starting point for understanding the role of momentum in matrix-based optimizers. Many modern optimizers (such as Shampoo, SOAP) involve matrix operations and momentum accumulation, and this analytical framework can be extended to these scenarios to help understand their effectiveness and improvement directions.

7

Section 07

Implications for Practice

Implications for large model training engineers and researchers:

  1. Confirm the importance of momentum in matrix optimizers; it should not be easily removed or simplified.
  2. Theoretical basis for adjusting momentum parameters: essentially, it is adjusting the cutoff frequency of the spectral filter.
  3. The principle of 'denoise first, orthogonalize later' may be applicable to the design of other multi-step optimization algorithms; attention should be paid to the interaction effects between technologies.