Reading

XTuner V1: Next-Generation Training Engine for Ultra-Large-Scale MoE Models

XTuner V1 is a next-generation LLM training engine specifically designed for ultra-large-scale Mixture-of-Experts (MoE) models. It breaks through the limitations of traditional 3D parallel architecture, supports training models up to 1 trillion parameters, and achieves training efficiency exceeding H800 on Ascend NPUs.

XTunerMoE混合专家模型大模型训练专家并行昇腾NPU长序列训练开源框架上海AI实验室

Published 2026-03-30 11:14Recent activity 2026-03-30 11:20Estimated read 7 min

XTuner V1: Next-Generation Training Engine for Ultra-Large-Scale MoE Models

Section 01

XTuner V1: Introduction to the Next-Generation Training Engine for Ultra-Large-Scale MoE Models

XTuner V1 is a next-generation LLM training engine developed by Shanghai AI Laboratory, specifically designed for ultra-large-scale Mixture-of-Experts (MoE) models. It breaks through the limitations of traditional 3D parallel architecture, supports training models up to 1 trillion parameters, and achieves training efficiency exceeding H800 on Ascend NPUs. Its core advantages include simplified parallel strategies, support for long-sequence training, cross-hardware platform compatibility, and full-link algorithm capabilities, aiming to lower the research threshold for ultra-large-scale MoE models and promote the construction of domestic computing power ecosystems.

Section 02

Background: Technical Challenges in MoE Model Training

Mixture-of-Experts (MoE) models achieve exponential growth in parameter scale through sparse activation mechanisms, but training faces challenges such as expert parallel complexity, load balancing issues, and memory bottlenecks in long-sequence training. Traditional 3D parallel strategies (data + tensor + pipeline + expert parallelism) have scalability bottlenecks in MoE models with over 200 billion parameters, so simplifying parallel strategies while maintaining efficiency has become a focus of the industry.

Section 03

Core Architectural Innovations of XTuner V1

Dropless Training: Breaking Through Expert Parallelism Limitations

No expert parallelism required for 200-billion-parameter models, reducing system complexity
Only intra-node expert parallelism needed for 600-billion-parameter models, cutting cross-node communication overhead
Optimized load balancing to ensure training stability

Long-Sequence Training Support

Memory optimization technology: Training 200-billion-parameter MoE models with 64K sequence length without sequence parallelism
Supports DeepSpeed Ulysses sequence parallelism, enabling linear expansion of maximum sequence length
Optimized for expert load fluctuations in long sequences to ensure stability

Section 04

Performance: Redefining Training Efficiency Standards

Scale Support Capability

Supports training of MoE models up to 1 trillion parameters
For models with over 200 billion parameters, FSDP training throughput exceeds traditional 3D parallelism for the first time
After optimization on Ascend A3 super nodes, efficiency surpasses NVIDIA H800

Multi-Hardware Platform Support

Model	GPU (FP8)	GPU (BF16)	NPU (BF16)
Intern S1	✅	✅	✅
Intern VL	✅	✅	✅
Qwen3 Dense	✅	✅	✅
Qwen3 MoE	✅	✅	✅
GPT OSS	✅	✅	🚧
Deepseek V3	✅	✅	🚧
KIMI K2	✅	✅	🚧

Section 05

Algorithm Capabilities: Full-Link Support from Pre-Training to Reinforcement Learning

Implemented Features

Multimodal pre-training: End-to-end support for vision-language model training
Multimodal Supervised Fine-Tuning (SFT): Optimized for instruction-following tasks
GRPO: Supports Group Relative Policy reinforcement learning training

Coming Soon

MPO: Mixed Preference Optimization algorithm
DAPO: Dynamic Sampling Policy Optimization
Multi-turn Agentic RL: Advanced reinforcement learning capabilities for agents

Section 06

Ecosystem Integration and Open-Source Contributions

As a general training backend for the open-source ecosystem, XTuner V1 seamlessly integrates with mainstream inference frameworks: LMDeploy (deployment and inference), vLLM (high-throughput service), and SGLang (structured generation). It also draws on training engines like TorchTitan, DeepSpeed, MindSpeed, and Megatron, as well as reinforcement learning frameworks such as veRL, SLIME, AReaL, and OpenRLHF, embodying the spirit of open collaboration.

Section 07

Practical Significance and Future Outlook

Significance of XTuner V1's release:

Lowering research thresholds: Simplified parallel strategies allow more teams to participate in ultra-large-scale MoE research
Domestic computing power optimization: In-depth optimization for Ascend NPUs supports the domestic AI chip ecosystem
Full-link support: Meets the full-stage needs of industry, academia, and research from pre-training to reinforcement learning

With the widespread application of MoE in models like GPT-4, Claude, and Kimi, XTuner V1 is expected to become a key infrastructure for ultra-large-scale model training.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15