Zing Forum

Reading

LoongForge: In-Depth Analysis of Baidu's Open-Source Large-Scale Multimodal Model Training Framework

An in-depth analysis of the LoongForge training framework launched by Baidu's Baige AI Infrastructure Platform, covering its unified support for LLM, VLM, VLA, and diffusion models, heterogeneous parallel optimization strategies, and practical experience in enterprise-level large-scale clusters.

LoongForge百度百舸大模型训练多模态模型VLMVLA扩散模型Megatron-LM昆仑XPU
Published 2026-04-27 14:59Recent activity 2026-04-27 15:22Estimated read 7 min
LoongForge: In-Depth Analysis of Baidu's Open-Source Large-Scale Multimodal Model Training Framework
1

Section 01

[Introduction] LoongForge: Core Analysis of Baidu's Open-Source Large-Scale Multimodal Model Training Framework

LoongForge, launched by Baidu's Baige AI Infrastructure Platform, is an open-source training framework that unifies support for LLM, VLM, VLA, and diffusion models, aiming to address the diverse scenario needs of training models across different modalities. As a core component of the "Loong" open-source series, it features modularity, scalability, and high performance, supporting the full workflow from pre-training to supervised fine-tuning, and has verified its acceleration capability and reliability in enterprise-level clusters.

2

Section 02

Background and Project Positioning

With the rapid development of LLM, VLM, VLA, and diffusion models, traditional single-purpose training frameworks struggle to meet diverse computing needs needs. LoongForge is built and enhanced based on Megatron-LM, with core design principles of modularity (component-based model decomposition), scalability (heterogeneous hardware support + flexible parallel strategies), and high performance (system-level optimization brings 30%+ acceleration). It is a core component of Baidu's "Loong" open-source series, on par with LoongFlow.

3

Section 03

Detailed Explanation of Core Technical Features

LoongForge's core technologies include:

  1. Flexible Composable Architecture: Configuration-driven VLM assembly (combining ViT and LLM via YAML configuration), supporting mainstream LLMs (LLaMA, Qwen, etc.), VLMs (Qwen-VL, InternVL, etc.), diffusion models (WAN2.2), and embodied models (Pi0.5).
  2. Heterogeneous Parallelism and Decoupled Training: Configure independent parallel strategies for different components (e.g., visual encoder and language model), decoupling encoder-decoder training to eliminate pipeline bubbles.
  3. Load Balancing and MoE Optimization: Load-aware data redistribution solves data parallel load imbalance; MoE All2All optimization (overlapping communication and computation, activation offloading) reduces memory usage.
  4. Adaptive FP8 Training: End-to-end FP8 support, automatically enabling FP8 based on GEMM shape to balance performance and stability.
  5. Fused Operators and Checkpoint Conversion: Fused operators like FusedDSA accelerate training; supports bidirectional weight conversion between Megatron and HuggingFace, as well as online loading.
4

Section 04

Model and Hardware Support Matrix

Model Support:

  • LLM: DeepSeek series (V2, V3, V3.2), LLaMA series (2, 3, 3.1, supporting up to 405B parameters), Qwen series (including MoE variants), MiniMax M2, etc.
  • VLM: Qwen2.5-VL, ERNIE4.5-VL, LLaVA-OneVision-1.5, etc., supporting custom ViT+LLM combinations.
  • Diffusion models: WAN2.2 I2V.
  • Embodied models: Pi0.5.

Hardware Support: Natively supports NVIDIA GPU (optimized for Hopper architecture) and Kunlun XPU (complete guide for P800 platform), enabling a heterogeneous unified platform via plugin design.

5

Section 05

Enterprise Practice and Ecosystem Collaboration

Enterprise Deployment: Before open-sourcing, it already supported large model training in Baidu's internal education, code generation, and other fields, with an average acceleration of over 30%, and seamlessly supports ultra-large-scale clusters of 5000+ XPUs. Ecosystem Collaboration: Collaborates with open-source projects like Qianfan-VL and LLaVA-OneVision-1.5; benefits from community contributions from Megatron-LM, Transformers, etc.

6

Section 06

Quick Start and Future Roadmap

Quick Start: Provides detailed documentation for GPU/XPU platforms, including model configuration, quick start guides for LLM/VLM/VLA pre-training/SFT, diffusion model training guides, uses Hydra for configuration management, and example scripts are in the examples directory. Future Roadmap:

  • Model Expansion: Support models like Kimi 2.6 and DreamZero.
  • Performance Optimization: Improve kernel performance, optimize memory overhead of full heterogeneous DP.
  • Advanced Features: Advanced MoE load balancing, INT4 quantization-aware training, long sequence training optimization, speculative decoding MTP expansion.
7

Section 07

Summary and Outlook

LoongForge marks an important progress in domestic AI training frameworks. As a unified multimodal training platform, it combines technical innovation with enterprise-level reliability. It provides researchers and engineers with a fully functional and high-performance tool, and its support for Kunlun XPU helps build independently controllable AI infrastructure. We look forward to the continuous prosperity of the community and more contributions to the open-source AI ecosystem.