Zing Forum

Reading

Quantitative Analysis of Hyperparameter Transfer in Large Model Training: The Critical Role of Embedding Layer Learning Rate

The study reveals that the advantages of Maximal Update Parameterization (μP) mainly come from the increase in the embedding layer learning rate, rather than complex parameterization theories. Through systematic ablation experiments, it is found that the embedding layer is a training bottleneck in standard parameterization.

大语言模型超参数迁移学习率嵌入层参数化AdamW
Published 2026-05-21 01:59Recent activity 2026-05-21 11:48Estimated read 7 min
Quantitative Analysis of Hyperparameter Transfer in Large Model Training: The Critical Role of Embedding Layer Learning Rate
1

Section 01

Core Guide to Hyperparameter Transfer Research for Large Models

This study focuses on the hyperparameter transfer problem in large model training. The core finding is: The advantages of Maximal Update Parameterization (μP) mainly come from the increase in the embedding layer learning rate, rather than complex parameterization theories. Through quantitative analysis and ablation experiments, it is revealed that the embedding layer is a training bottleneck in standard parameterization. This article will provide a detailed interpretation from aspects such as background, methods, experiments, and suggestions.

2

Section 02

Background: Demand for Hyperparameter Transfer in Large Model Training

Training a GPT-4-level large model is extremely costly (millions to tens of millions of dollars per run). Hyperparameter transfer has become a key strategy to reduce trial-and-error costs—searching for optimal hyperparameters in small-scale models and then extrapolating to large models. There are two main paths: 1. Fitting scaling laws to predict optimal values for large models; 2. Using parameterization designs (such as μP) to make hyperparameters approximately invariant across different scales. However, existing theories lack a sufficient explanation for the effectiveness of μP.

3

Section 03

Methods: Quantitative Evaluation Framework for Hyperparameter Transfer Quality

The study established a three-dimensional quantitative framework: 1. Scaling law fitting quality (measures the prediction accuracy of small-scale fitting rules for large models); 2. Extrapolation error robustness (evaluates the impact of small-scale errors on large model performance); 3. Parameterization asymptotic loss penalty (compares performance gaps between different parameterizations at the large-scale limit). These three together form a complete evaluation picture.

4

Section 04

Core Finding: The Embedding Layer is a Training Bottleneck in Standard Parameterization

The core advantage of μP over Standard Parameterization (SP) lies in increasing the embedding layer learning rate. In SP, the embedding layer has a large number of parameters (proportional to vocabulary size and model dimension), leading to limited gradient updates, which causes problems such as unstable training, slow convergence, and hyperparameter sensitivity. μP removes this bottleneck by scaling the embedding layer learning rate by the model width (increasing by width times); simply increasing the embedding layer learning rate in SP to the μP level can significantly improve hyperparameter transfer effects.

5

Section 05

Analysis: The Dual Effects of Weight Decay

Weight Decay has dual effects: The positive effect is improving the fitting quality of scaling laws, making small-scale extrapolation more reliable; the negative effect is impairing extrapolation robustness under fixed token-per-parameter settings (small experimental errors are easily amplified). Practical training needs to balance between fitting quality and robustness.

6

Section 06

Experimental Verification: Key Experiments Support the Core Hypothesis

Verification through a series of ablation experiments found: 1. Embedding layer learning rate ablation: Adjusting the embedding layer learning rate alone in SP can achieve μP-like effects; 2. Component-level analysis: Identifying the embedding layer as a key component; 3. Scale extrapolation test: Extrapolating from small scale to large scale to verify transfer effects. The results strongly support the "embedding layer bottleneck" hypothesis.

7

Section 07

Practical Recommendations: Optimization Strategies for Large Model Training

Based on the research findings, four recommendations are proposed: 1. Prioritize the setting of the embedding layer learning rate; 2. Simplify μP implementation: Most benefits can be obtained by only increasing the embedding layer learning rate in SP; 3. Weight decay tuning: Adjust strategies according to training settings (fixed steps vs fixed token count); 4. Small experiment design: Ensure that the embedding layer behavior can represent large models to avoid extrapolation failure.

8

Section 08

Limitations, Future Directions, and Conclusion

Research limitations: Verified only on AdamW optimizer and Transformer architecture; applicability to other optimizers (such as Adam, SGD) and architectures (such as Mamba, RWKV) remains to be verified. Future directions: 1. Deep mathematical principles of the embedding layer bottleneck; 2. Similar bottlenecks in multimodal models (such as vision-language); 3. Dynamically adjusting the embedding layer learning rate during training. Conclusion: This study reveals the source of μP's advantages through detailed exploration, provides concise and effective guidance for large model training, and emphasizes that insights come from details rather than complex theories.