Section 01
moe-engine: Guide to Sparse MoE Training Infrastructure for 10k-GPU Clusters
Project Basic Information
- Maintainer: Mattral
- Source Code: Composed-Mixture-of-Experts-Engine
Core Positioning
moe-engine is a sparse MoE training runtime infrastructure for ultra-large-scale GPU clusters, designed specifically for continuous node failure scenarios in 10k+ GPU clusters, aiming to achieve training stability without human intervention.
Key Features
- Supports 4D parallel strategy (DP+EP+TP+PP)
- Asynchronous hierarchical checkpoint mechanism
- TorchElastic fault tolerance recovery
- Fused Triton routing kernel optimization