Reading

LoongForge: In-Depth Analysis of Baidu's Open-Source Large-Scale Multimodal Model Training Framework

An in-depth analysis of the LoongForge training framework launched by Baidu's Baige AI Infrastructure Platform, covering its unified support for LLM, VLM, VLA, and diffusion models, heterogeneous parallel optimization strategies, and practical experience in enterprise-level large-scale clusters.

LoongForge百度百舸大模型训练多模态模型VLMVLA扩散模型Megatron-LM昆仑XPU

Published 2026-04-27 14:59Recent activity 2026-04-27 15:22Estimated read 7 min

LoongForge: In-Depth Analysis of Baidu's Open-Source Large-Scale Multimodal Model Training Framework

Section 01

[Introduction] LoongForge: Core Analysis of Baidu's Open-Source Large-Scale Multimodal Model Training Framework

LoongForge, launched by Baidu's Baige AI Infrastructure Platform, is an open-source training framework that unifies support for LLM, VLM, VLA, and diffusion models, aiming to address the diverse scenario needs of training models across different modalities. As a core component of the "Loong" open-source series, it features modularity, scalability, and high performance, supporting the full workflow from pre-training to supervised fine-tuning, and has verified its acceleration capability and reliability in enterprise-level clusters.

Section 02

Background and Project Positioning

With the rapid development of LLM, VLM, VLA, and diffusion models, traditional single-purpose training frameworks struggle to meet diverse computing needs needs. LoongForge is built and enhanced based on Megatron-LM, with core design principles of modularity (component-based model decomposition), scalability (heterogeneous hardware support + flexible parallel strategies), and high performance (system-level optimization brings 30%+ acceleration). It is a core component of Baidu's "Loong" open-source series, on par with LoongFlow.

Section 03

Detailed Explanation of Core Technical Features

LoongForge's core technologies include:

Flexible Composable Architecture: Configuration-driven VLM assembly (combining ViT and LLM via YAML configuration), supporting mainstream LLMs (LLaMA, Qwen, etc.), VLMs (Qwen-VL, InternVL, etc.), diffusion models (WAN2.2), and embodied models (Pi0.5).
Heterogeneous Parallelism and Decoupled Training: Configure independent parallel strategies for different components (e.g., visual encoder and language model), decoupling encoder-decoder training to eliminate pipeline bubbles.
Load Balancing and MoE Optimization: Load-aware data redistribution solves data parallel load imbalance; MoE All2All optimization (overlapping communication and computation, activation offloading) reduces memory usage.
Adaptive FP8 Training: End-to-end FP8 support, automatically enabling FP8 based on GEMM shape to balance performance and stability.
Fused Operators and Checkpoint Conversion: Fused operators like FusedDSA accelerate training; supports bidirectional weight conversion between Megatron and HuggingFace, as well as online loading.

Section 04

Model and Hardware Support Matrix

Model Support:

LLM: DeepSeek series (V2, V3, V3.2), LLaMA series (2, 3, 3.1, supporting up to 405B parameters), Qwen series (including MoE variants), MiniMax M2, etc.
VLM: Qwen2.5-VL, ERNIE4.5-VL, LLaVA-OneVision-1.5, etc., supporting custom ViT+LLM combinations.
Diffusion models: WAN2.2 I2V.
Embodied models: Pi0.5.

Hardware Support: Natively supports NVIDIA GPU (optimized for Hopper architecture) and Kunlun XPU (complete guide for P800 platform), enabling a heterogeneous unified platform via plugin design.

Section 05

Enterprise Practice and Ecosystem Collaboration

Enterprise Deployment: Before open-sourcing, it already supported large model training in Baidu's internal education, code generation, and other fields, with an average acceleration of over 30%, and seamlessly supports ultra-large-scale clusters of 5000+ XPUs. Ecosystem Collaboration: Collaborates with open-source projects like Qianfan-VL and LLaVA-OneVision-1.5; benefits from community contributions from Megatron-LM, Transformers, etc.

Section 06

Quick Start and Future Roadmap

Quick Start: Provides detailed documentation for GPU/XPU platforms, including model configuration, quick start guides for LLM/VLM/VLA pre-training/SFT, diffusion model training guides, uses Hydra for configuration management, and example scripts are in the examples directory. Future Roadmap:

Model Expansion: Support models like Kimi 2.6 and DreamZero.
Performance Optimization: Improve kernel performance, optimize memory overhead of full heterogeneous DP.
Advanced Features: Advanced MoE load balancing, INT4 quantization-aware training, long sequence training optimization, speculative decoding MTP expansion.

Section 07

Summary and Outlook

LoongForge marks an important progress in domestic AI training frameworks. As a unified multimodal training platform, it combines technical innovation with enterprise-level reliability. It provides researchers and engineers with a fully functional and high-performance tool, and its support for Kunlun XPU helps build independently controllable AI infrastructure. We look forward to the continuous prosperity of the community and more contributions to the open-source AI ecosystem.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23