Zing Forum

Reading

Practical Guide to Continued Pre-training of Large Models: Production-Grade Pipeline Based on PyTorch FSDP

A production-oriented continued pre-training framework for large language models, supporting PyTorch FSDP distributed training, validated on Qwen2.5-0.5B, and providing a complete workflow from data conversion to model deployment.

大语言模型LLM持续预训练PyTorch FSDP分布式训练领域适应Qwen模型微调
Published 2026-06-14 05:35Recent activity 2026-06-14 05:50Estimated read 5 min
Practical Guide to Continued Pre-training of Large Models: Production-Grade Pipeline Based on PyTorch FSDP
1

Section 01

Introduction to Practical Continued Pre-training of Large Models: Production-Grade Pipeline Based on PyTorch FSDP

This project is a production-oriented continued pre-training framework for large language models, supporting PyTorch FSDP distributed training, validated on Qwen2.5-0.5B, and providing a complete workflow from data conversion to model deployment. Maintained by josephGoke, the source code is available on GitHub (link: https://github.com/josephGoke/llm-continued-pretraining), released on June 13, 2026.

2

Section 02

Why Do We Need Continued Pre-training?

General pre-trained models lack domain-specific expertise; full pre-training from scratch is costly; simple fine-tuning struggles to inject large amounts of new knowledge. As a middle path, continued pre-training uses domain-specific corpus to further train existing models, which not only retains general capabilities but also absorbs domain knowledge, making it the mainstream solution for building domain-specific large models.

3

Section 03

Analysis of Core Features of the Project

  1. PyTorch FSDP distributed training: Uses FULL_SHARD sharding strategy, CPU offloading, backward prefetching, etc., to reduce single-GPU memory usage;
  2. Enterprise-level data pipeline: Supports conversion of multiple formats (txt, CSV, PDF, JSON) to JSONL;
  3. Flexible configuration system: Manages models, training hyperparameters, and optimization techniques (e.g., gradient checkpointing, BF16 mixed precision) via YAML;
  4. Comprehensive monitoring: Log saving, Weights & Biases tracking, regular validation, and resuming training from checkpoints.
4

Section 04

Technical Architecture and Training Workflow

The project structure includes directories like config, data, scripts, outputs, with the main training script being train.py. Training workflow:

  1. Data preparation (convert to JSONL, split into training/validation sets in 9:1 ratio);
  2. Download base model (e.g., Qwen2.5-0.5B);
  3. Adjust configuration files;
  4. Start training (single GPU: python train.py; multi-GPU: torchrun --nproc_per_node=4 train.py);
  5. Inference testing (inference.py).
5

Section 05

Hardware Requirements and Performance Validation

Minimum hardware requirements: Python3.10+, CUDA12.0+, 8GB RAM, 8GB GPU memory; Recommended configuration: 16GB+ RAM, 24GB+ GPU memory (for 7B+ models). Validated on Qwen2.5-0.5B (494M parameters), which can run on a single GPU; multi-GPU distributed training is recommended for larger models.

6

Section 06

Practical Operations for Distributed Training

Single-machine multi-GPU command: torchrun --nproc_per_node=4 train.py; Multi-machine multi-GPU command: torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=192.168.1.100 --master_port=29500 train.py; Supports HuggingFace Accelerate: First run accelerate config, then accelerate launch train.py.

7

Section 07

Checkpoint Resumption and Model Deployment

Checkpoint resumption: Automatically detects the latest checkpoint or manually set resume_from_checkpoint; saves checkpoints every 1000 steps, keeping the latest 3. Deployment: Use inference.py for local inference; to upload to HuggingFace Hub, configure push_to_hub=true and hub_model_id.

8

Section 08

Applicable Scenarios and Project Summary

Applicable scenarios: Domain knowledge injection (medical/legal/financial), multilingual expansion, code model training, enterprise private deployment. Summary: This project provides a complete production-grade framework covering the entire workflow, suitable for researchers and enterprise developers to quickly start domain model training.