# The Spectral Filtering Nature of Momentum in the Muon Optimizer: Denoise First, Orthogonalize Later

> The study reveals the theoretical role of momentum in the Muon optimizer: momentum acts as a spectral filter, which suppresses disturbances and preserves dominant signals under the structured signal plus perturbed gradient model, providing a more stable singular subspace for the orthogonalization step.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-02T16:54:38.000Z
- 最近活动: 2026-06-03T05:22:59.614Z
- 热度: 134.5
- 关键词: Muon优化器, 动量, 谱滤波, 正交化, 大语言模型训练, 优化理论
- 页面链接: https://www.zingnex.cn/en/forum/thread/muon
- Canonical: https://www.zingnex.cn/forum/thread/muon
- Markdown 来源: floors_fallback

---

## Introduction to The Spectral Filtering Nature of Momentum in the Muon Optimizer: Denoise First, Orthogonalize Later

### Core Insights
The study reveals the theoretical role of momentum in the Muon optimizer: momentum acts as a spectral filter, which suppresses disturbances and preserves dominant signals under the structured signal plus perturbed gradient model, amplifies the spectral gap to stabilize the singular subspace of the orthogonalization step; additionally, the order of 'calculating momentum first, then orthogonalizing' is critical, and the theory has been experimentally verified.

### Original Article Information
- Original Authors: arXiv Author Team
- Source: arXiv (published on June 2, 2026)
- Original Title: Denoise First, Orthogonalize Later: Understanding Momentum in Muon via Spectral Filtering
- Original Link: http://arxiv.org/abs/2606.03899v1

## The Rise of the Muon Optimizer and Theoretical Gaps

The Muon optimizer has recently demonstrated strong empirical performance in large language model training, attracting widespread attention. However, key theoretical questions remain unresolved: what exactly is the role of momentum in Muon? Existing analyses either remove momentum to study spectral updates in isolation, or retain momentum but fail to explain why it improves performance. This theoretical ambiguity limits understanding of Muon's working principles and hinders further optimization and expansion.

## Core Finding: Momentum as a Spectral Filter

The paper fills the gap through rigorous theoretical analysis: momentum in Muon is actually a spectral filter. Under the model where gradients are decomposed into structured signals plus random disturbances, momentum amplifies persistent signal components through temporal cumulative averaging and weakens random disturbances with inconsistent directions; in the spectral domain, this manifests as enhancing the eigenvalues of dominant signals and suppressing those of noise.

The filtering effect of momentum amplifies the spectral gap between signals and disturbances, which is crucial for orthogonalization—orthogonalization relies on the stability of the singular subspace of the input matrix. A small spectral gap easily leads to significant changes in singular vectors due to disturbances, while an amplified spectral gap stabilizes the singular subspace of the matrix passed to the orthogonalization step, making updates more reliable and consistent.

## Importance of Order: Denoise First, Orthogonalize Later

The paper proves the criticality of the operation order: applying momentum before orthogonalization provides a stronger guarantee of alignment with gradient signal components compared to reversing the order or removing momentum entirely.

This explains the excellence of Muon's 'momentum first, orthogonalization later' design: this order ensures that the matrix input to orthogonalization has been purified, with its main structure reflecting the true optimization direction rather than noise.

## Experimental Verification and Hyperparameter Tuning

The theoretical analysis has been verified on diverse tasks (including large language model pre-training), and the experimental results are highly consistent with theoretical predictions, supporting the interpretation of momentum as a spectral filter.

Based on this understanding, researchers can adjust Muon's hyperparameters (such as momentum coefficient, orthogonalization frequency) in a targeted manner to achieve better performance on specific tasks.

## Broader Significance of the Research

The significance of this work goes beyond Muon itself: it provides a theoretical starting point for understanding the role of momentum in matrix-based optimizers. Many modern optimizers (such as Shampoo, SOAP) involve matrix operations and momentum accumulation, and this analytical framework can be extended to these scenarios to help understand their effectiveness and improvement directions.

## Implications for Practice

Implications for large model training engineers and researchers:
1. Confirm the importance of momentum in matrix optimizers; it should not be easily removed or simplified.
2. Theoretical basis for adjusting momentum parameters: essentially, it is adjusting the cutoff frequency of the spectral filter.
3. The principle of 'denoise first, orthogonalize later' may be applicable to the design of other multi-step optimization algorithms; attention should be paid to the interaction effects between technologies.
