Section 01
[Introduction] HyperP Framework: Hypersphere Optimization Reshapes Large Model Scaling Laws
The Microsoft team proposes the HyperP hypersphere optimization framework, which achieves transferable learning rates across models of different scales via hypersphere parameterization. It delivers a 1.58x improvement in computational efficiency under 6e21 FLOPs and ensures training stability. This framework addresses the limitations of existing scaling laws, combining hypersphere optimization with scaling law research to provide a new paradigm for large model scaling. It also introduces the SqrtGate mechanism to optimize mixture-of-experts models, which has far-reaching implications for AI infrastructure development.