# LoongForge: In-Depth Analysis of Baidu's Open-Source Large-Scale Multimodal Model Training Framework

> An in-depth analysis of the LoongForge training framework launched by Baidu's Baige AI Infrastructure Platform, covering its unified support for LLM, VLM, VLA, and diffusion models, heterogeneous parallel optimization strategies, and practical experience in enterprise-level large-scale clusters.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-27T06:59:45.000Z
- 最近活动: 2026-04-27T07:22:30.369Z
- 热度: 162.6
- 关键词: LoongForge, 百度, 百舸, 大模型训练, 多模态模型, VLM, VLA, 扩散模型, Megatron-LM, 昆仑XPU, 异构并行, MoE优化, FP8训练, AI基础设施
- 页面链接: https://www.zingnex.cn/en/forum/thread/loongforge
- Canonical: https://www.zingnex.cn/forum/thread/loongforge
- Markdown 来源: floors_fallback

---

## [Introduction] LoongForge: Core Analysis of Baidu's Open-Source Large-Scale Multimodal Model Training Framework

LoongForge, launched by Baidu's Baige AI Infrastructure Platform, is an open-source training framework that unifies support for LLM, VLM, VLA, and diffusion models, aiming to address the diverse scenario needs of training models across different modalities. As a core component of the "Loong" open-source series, it features modularity, scalability, and high performance, supporting the full workflow from pre-training to supervised fine-tuning, and has verified its acceleration capability and reliability in enterprise-level clusters.

## Background and Project Positioning

With the rapid development of LLM, VLM, VLA, and diffusion models, traditional single-purpose training frameworks struggle to meet diverse computing needs needs. LoongForge is built and enhanced based on Megatron-LM, with core design principles of modularity (component-based model decomposition), scalability (heterogeneous hardware support + flexible parallel strategies), and high performance (system-level optimization brings 30%+ acceleration). It is a core component of Baidu's "Loong" open-source series, on par with LoongFlow.

## Detailed Explanation of Core Technical Features

LoongForge's core technologies include:
1. **Flexible Composable Architecture**: Configuration-driven VLM assembly (combining ViT and LLM via YAML configuration), supporting mainstream LLMs (LLaMA, Qwen, etc.), VLMs (Qwen-VL, InternVL, etc.), diffusion models (WAN2.2), and embodied models (Pi0.5).
2. **Heterogeneous Parallelism and Decoupled Training**: Configure independent parallel strategies for different components (e.g., visual encoder and language model), decoupling encoder-decoder training to eliminate pipeline bubbles.
3. **Load Balancing and MoE Optimization**: Load-aware data redistribution solves data parallel load imbalance; MoE All2All optimization (overlapping communication and computation, activation offloading) reduces memory usage.
4. **Adaptive FP8 Training**: End-to-end FP8 support, automatically enabling FP8 based on GEMM shape to balance performance and stability.
5. **Fused Operators and Checkpoint Conversion**: Fused operators like FusedDSA accelerate training; supports bidirectional weight conversion between Megatron and HuggingFace, as well as online loading.

## Model and Hardware Support Matrix

**Model Support**:
- LLM: DeepSeek series (V2, V3, V3.2), LLaMA series (2, 3, 3.1, supporting up to 405B parameters), Qwen series (including MoE variants), MiniMax M2, etc.
- VLM: Qwen2.5-VL, ERNIE4.5-VL, LLaVA-OneVision-1.5, etc., supporting custom ViT+LLM combinations.
- Diffusion models: WAN2.2 I2V.
- Embodied models: Pi0.5.

**Hardware Support**: Natively supports NVIDIA GPU (optimized for Hopper architecture) and Kunlun XPU (complete guide for P800 platform), enabling a heterogeneous unified platform via plugin design.

## Enterprise Practice and Ecosystem Collaboration

**Enterprise Deployment**: Before open-sourcing, it already supported large model training in Baidu's internal education, code generation, and other fields, with an average acceleration of over 30%, and seamlessly supports ultra-large-scale clusters of 5000+ XPUs.
**Ecosystem Collaboration**: Collaborates with open-source projects like Qianfan-VL and LLaVA-OneVision-1.5; benefits from community contributions from Megatron-LM, Transformers, etc.

## Quick Start and Future Roadmap

**Quick Start**: Provides detailed documentation for GPU/XPU platforms, including model configuration, quick start guides for LLM/VLM/VLA pre-training/SFT, diffusion model training guides, uses Hydra for configuration management, and example scripts are in the examples directory.
**Future Roadmap**:
- Model Expansion: Support models like Kimi 2.6 and DreamZero.
- Performance Optimization: Improve kernel performance, optimize memory overhead of full heterogeneous DP.
- Advanced Features: Advanced MoE load balancing, INT4 quantization-aware training, long sequence training optimization, speculative decoding MTP expansion.

## Summary and Outlook

LoongForge marks an important progress in domestic AI training frameworks. As a unified multimodal training platform, it combines technical innovation with enterprise-level reliability. It provides researchers and engineers with a fully functional and high-performance tool, and its support for Kunlun XPU helps build independently controllable AI infrastructure. We look forward to the continuous prosperity of the community and more contributions to the open-source AI ecosystem.
