Zing Forum

Reading

MyLLM: A Complete Open-Source Framework for Building Large Language Models from Scratch

MyLLM is an open-source project for building large language models from scratch, providing a complete pipeline from tokenizer training to RLHF reinforcement learning, helping developers deeply understand every detail of the Transformer architecture.

大语言模型TransformerPyTorch开源框架机器学习深度学习LLM训练RLHFLoRAGitHub
Published 2026-05-03 12:40Recent activity 2026-05-03 12:47Estimated read 5 min
MyLLM: A Complete Open-Source Framework for Building Large Language Models from Scratch
1

Section 01

Introduction / Main Post: MyLLM: A Complete Open-Source Framework for Building Large Language Models from Scratch

MyLLM is an open-source project for building large language models from scratch, providing a complete pipeline from tokenizer training to RLHF reinforcement learning, helping developers deeply understand every detail of the Transformer architecture.

2

Section 02

Project Background and Motivation

In the current large language model ecosystem, frameworks like Hugging Face, PyTorch Lightning, and TRL are quite mature, but they encapsulate a lot of low-level details for ease of use. For researchers and developers who want to truly understand the working principles of Transformers, these "black box" abstractions have instead become learning barriers.

The MyLLM project was born out of this need, with its core philosophy being "From Zero to Hero"—allowing users to truly understand the complete technology stack of modern large language models by implementing each component with their own hands. This project is not just a framework, but also a systematic learning path.

3

Section 03

Architecture Design: Transparent Technology Stack

MyLLM adopts a layered architecture design, breaking down the complex large model training process into clear and readable modules:

4

Section 04

Core Module Composition

  • model.py: Defines the core model structure in GPT/LLaMA style
  • api.py: Provides the LLM class, supporting functions like loading, text generation, batch generation, etc.
  • Configs/: Uses dataclass to define ModelConfig and GenerationConfig
  • Tokenizers/: Supports GPT2, LLaMA2, LLaMA3, and trainable tokenizers
  • Train/: Contains training engines like SFT, DPO, PPO
  • utils/: Loaders, samplers, weight mappers, and model registries
5

Section 05

Training Engine Architecture

The training module uses a plug-in design and supports multiple training paradigms:

  • SFTTrainer: Supervised Fine-Tuning Trainer (fully implemented)
  • DPOTrainer: Direct Preference Optimization (reserved in the framework)
  • PPOTrainer: Proximal Policy Optimization/RLHF (reserved in the framework)
  • Accelerator: Supports multiple acceleration schemes like single GPU, DDP, DeepSpeed, FSDP
6

Section 06

Learning Path: From Theory to Practice

MyLLM provides three progressive learning paths to meet the needs of users at different stages:

7

Section 07

1. Guided Notebooks (notebooks/)

Contains 21 carefully designed Jupyter Notebooks covering every step from word embedding to attention mechanisms, and then to complete model training. Each notebook is equipped with detailed theoretical explanations and runnable code examples.

8

Section 08

2. Independent Experiment Modules (Modules/)

Breaks down complex concepts into independent experimental units, each module focusing on one core concept such as positional encoding, multi-head attention, layer normalization, etc. This "master one concept at a time" design reduces the learning curve.