Zing Forum

Reading

XTuner V1: Next-Generation Training Engine for Ultra-Large-Scale MoE Models

XTuner V1 is a next-generation LLM training engine specifically designed for ultra-large-scale Mixture-of-Experts (MoE) models. It breaks through the limitations of traditional 3D parallel architecture, supports training models up to 1 trillion parameters, and achieves training efficiency exceeding H800 on Ascend NPUs.

XTunerMoE混合专家模型大模型训练专家并行昇腾NPU长序列训练开源框架上海AI实验室
Published 2026-03-30 11:14Recent activity 2026-03-30 11:20Estimated read 7 min
XTuner V1: Next-Generation Training Engine for Ultra-Large-Scale MoE Models
1

Section 01

XTuner V1: Introduction to the Next-Generation Training Engine for Ultra-Large-Scale MoE Models

XTuner V1 is a next-generation LLM training engine developed by Shanghai AI Laboratory, specifically designed for ultra-large-scale Mixture-of-Experts (MoE) models. It breaks through the limitations of traditional 3D parallel architecture, supports training models up to 1 trillion parameters, and achieves training efficiency exceeding H800 on Ascend NPUs. Its core advantages include simplified parallel strategies, support for long-sequence training, cross-hardware platform compatibility, and full-link algorithm capabilities, aiming to lower the research threshold for ultra-large-scale MoE models and promote the construction of domestic computing power ecosystems.

2

Section 02

Background: Technical Challenges in MoE Model Training

Mixture-of-Experts (MoE) models achieve exponential growth in parameter scale through sparse activation mechanisms, but training faces challenges such as expert parallel complexity, load balancing issues, and memory bottlenecks in long-sequence training. Traditional 3D parallel strategies (data + tensor + pipeline + expert parallelism) have scalability bottlenecks in MoE models with over 200 billion parameters, so simplifying parallel strategies while maintaining efficiency has become a focus of the industry.

3

Section 03

Core Architectural Innovations of XTuner V1

Dropless Training: Breaking Through Expert Parallelism Limitations

  • No expert parallelism required for 200-billion-parameter models, reducing system complexity
  • Only intra-node expert parallelism needed for 600-billion-parameter models, cutting cross-node communication overhead
  • Optimized load balancing to ensure training stability

Long-Sequence Training Support

  • Memory optimization technology: Training 200-billion-parameter MoE models with 64K sequence length without sequence parallelism
  • Supports DeepSpeed Ulysses sequence parallelism, enabling linear expansion of maximum sequence length
  • Optimized for expert load fluctuations in long sequences to ensure stability
4

Section 04

Performance: Redefining Training Efficiency Standards

Scale Support Capability

  • Supports training of MoE models up to 1 trillion parameters
  • For models with over 200 billion parameters, FSDP training throughput exceeds traditional 3D parallelism for the first time
  • After optimization on Ascend A3 super nodes, efficiency surpasses NVIDIA H800

Multi-Hardware Platform Support

Model GPU (FP8) GPU (BF16) NPU (BF16)
Intern S1
Intern VL
Qwen3 Dense
Qwen3 MoE
GPT OSS 🚧
Deepseek V3 🚧
KIMI K2 🚧
5

Section 05

Algorithm Capabilities: Full-Link Support from Pre-Training to Reinforcement Learning

Implemented Features

  • Multimodal pre-training: End-to-end support for vision-language model training
  • Multimodal Supervised Fine-Tuning (SFT): Optimized for instruction-following tasks
  • GRPO: Supports Group Relative Policy reinforcement learning training

Coming Soon

  • MPO: Mixed Preference Optimization algorithm
  • DAPO: Dynamic Sampling Policy Optimization
  • Multi-turn Agentic RL: Advanced reinforcement learning capabilities for agents
6

Section 06

Ecosystem Integration and Open-Source Contributions

As a general training backend for the open-source ecosystem, XTuner V1 seamlessly integrates with mainstream inference frameworks: LMDeploy (deployment and inference), vLLM (high-throughput service), and SGLang (structured generation). It also draws on training engines like TorchTitan, DeepSpeed, MindSpeed, and Megatron, as well as reinforcement learning frameworks such as veRL, SLIME, AReaL, and OpenRLHF, embodying the spirit of open collaboration.

7

Section 07

Practical Significance and Future Outlook

Significance of XTuner V1's release:

  1. Lowering research thresholds: Simplified parallel strategies allow more teams to participate in ultra-large-scale MoE research
  2. Domestic computing power optimization: In-depth optimization for Ascend NPUs supports the domestic AI chip ecosystem
  3. Full-link support: Meets the full-stage needs of industry, academia, and research from pre-training to reinforcement learning

With the widespread application of MoE in models like GPT-4, Claude, and Kimi, XTuner V1 is expected to become a key infrastructure for ultra-large-scale model training.