Reading

Practical Guide to Continued Pre-training of Large Models: Production-Grade Pipeline Based on PyTorch FSDP

A production-oriented continued pre-training framework for large language models, supporting PyTorch FSDP distributed training, validated on Qwen2.5-0.5B, and providing a complete workflow from data conversion to model deployment.

大语言模型LLM持续预训练PyTorch FSDP分布式训练领域适应Qwen模型微调

Published 2026-06-14 05:35Recent activity 2026-06-14 05:50Estimated read 5 min

Practical Guide to Continued Pre-training of Large Models: Production-Grade Pipeline Based on PyTorch FSDP

Section 01

Introduction to Practical Continued Pre-training of Large Models: Production-Grade Pipeline Based on PyTorch FSDP

This project is a production-oriented continued pre-training framework for large language models, supporting PyTorch FSDP distributed training, validated on Qwen2.5-0.5B, and providing a complete workflow from data conversion to model deployment. Maintained by josephGoke, the source code is available on GitHub (link: https://github.com/josephGoke/llm-continued-pretraining), released on June 13, 2026.

Section 02

Why Do We Need Continued Pre-training?

General pre-trained models lack domain-specific expertise; full pre-training from scratch is costly; simple fine-tuning struggles to inject large amounts of new knowledge. As a middle path, continued pre-training uses domain-specific corpus to further train existing models, which not only retains general capabilities but also absorbs domain knowledge, making it the mainstream solution for building domain-specific large models.

Section 03

Analysis of Core Features of the Project

PyTorch FSDP distributed training: Uses FULL_SHARD sharding strategy, CPU offloading, backward prefetching, etc., to reduce single-GPU memory usage;
Enterprise-level data pipeline: Supports conversion of multiple formats (txt, CSV, PDF, JSON) to JSONL;
Flexible configuration system: Manages models, training hyperparameters, and optimization techniques (e.g., gradient checkpointing, BF16 mixed precision) via YAML;
Comprehensive monitoring: Log saving, Weights & Biases tracking, regular validation, and resuming training from checkpoints.

Section 04

Technical Architecture and Training Workflow

The project structure includes directories like config, data, scripts, outputs, with the main training script being train.py. Training workflow:

Data preparation (convert to JSONL, split into training/validation sets in 9:1 ratio);
Download base model (e.g., Qwen2.5-0.5B);
Adjust configuration files;
Start training (single GPU: python train.py; multi-GPU: torchrun --nproc_per_node=4 train.py);
Inference testing (inference.py).

Section 05

Hardware Requirements and Performance Validation

Minimum hardware requirements: Python3.10+, CUDA12.0+, 8GB RAM, 8GB GPU memory; Recommended configuration: 16GB+ RAM, 24GB+ GPU memory (for 7B+ models). Validated on Qwen2.5-0.5B (494M parameters), which can run on a single GPU; multi-GPU distributed training is recommended for larger models.

Section 06

Practical Operations for Distributed Training

Single-machine multi-GPU command: torchrun --nproc_per_node=4 train.py; Multi-machine multi-GPU command: torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=192.168.1.100 --master_port=29500 train.py; Supports HuggingFace Accelerate: First run accelerate config, then accelerate launch train.py.

Section 07

Checkpoint Resumption and Model Deployment

Checkpoint resumption: Automatically detects the latest checkpoint or manually set resume_from_checkpoint; saves checkpoints every 1000 steps, keeping the latest 3. Deployment: Use inference.py for local inference; to upload to HuggingFace Hub, configure push_to_hub=true and hub_model_id.

Section 08

Applicable Scenarios and Project Summary

Applicable scenarios: Domain knowledge injection (medical/legal/financial), multilingual expansion, code model training, enterprise private deployment. Summary: This project provides a complete production-grade framework covering the entire workflow, suitable for researchers and enterprise developers to quickly start domain model training.

Practical Guide to Continued Pre-training of Large Models: Production-Grade Pipeline Based on PyTorch FSDP

Introduction to Practical Continued Pre-training of Large Models: Production-Grade Pipeline Based on PyTorch FSDP

Why Do We Need Continued Pre-training?

Analysis of Core Features of the Project

Technical Architecture and Training Workflow

Hardware Requirements and Performance Validation

Practical Operations for Distributed Training

Checkpoint Resumption and Model Deployment

Applicable Scenarios and Project Summary

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

Graph Neural Networks Revolutionize Global Weather Forecasting: From Graph Weather to Open-Source Practice of Multi-Model Fusion

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

Vertica Expert Skills: A One-Stop Guide to Enterprise Database Migration and Optimization