Section 01
Core Guide to Hyperparameter Transfer Research for Large Models
This study focuses on the hyperparameter transfer problem in large model training. The core finding is: The advantages of Maximal Update Parameterization (μP) mainly come from the increase in the embedding layer learning rate, rather than complex parameterization theories. Through quantitative analysis and ablation experiments, it is revealed that the embedding layer is a training bottleneck in standard parameterization. This article will provide a detailed interpretation from aspects such as background, methods, experiments, and suggestions.