Zing Forum

Reading

Building Large Language Models from Scratch: Complete Code Implementation of an Open-Source Book

An in-depth analysis of the milistu/LLMs-from-scratch project, which provides the complete code implementation for the book *Build a Large Language Model (From Scratch)*. It covers core concepts such as data loading, word embedding, and positional encoding, helping developers understand the internal mechanisms of GPT-like models from scratch.

大语言模型LLMGPTPyTorch深度学习Transformer词嵌入位置编码机器学习开源教程
Published 2026-06-10 19:44Recent activity 2026-06-10 19:49Estimated read 5 min
Building Large Language Models from Scratch: Complete Code Implementation of an Open-Source Book
1

Section 01

[Introduction] Building Large Language Models from Scratch: Analysis of the Open-Source Book's Supporting Code

The LLMs-from-scratch project maintained by milistu is the official supporting code repository for the book Build a Large Language Model (From Scratch), hosted on GitHub (link: https://github.com/milistu/LLMs-from-scratch) under the Apache License 2.0. It covers core concepts like data loading, word embedding, and positional encoding, helping developers understand the internal mechanisms of GPT-like models from scratch.

2

Section 02

Project Background and Motivation

With the explosive popularity of large language models like ChatGPT and Claude, developers want to deeply understand how these models work. However, most tutorials either stay at the API calling level or jump directly into complex paper implementations, lacking a complete, step-by-step learning path. This project aims to fill this gap by helping readers understand the working principles of each component of GPT-like models through a step-by-step building approach.

3

Section 03

Tech Stack and Project Overview

The project uses Python 3.13+ and a modern tech stack: PyTorch (>=2.12.0) for model building and training, tiktoken (>=0.13.0) as the tokenizer, NumPy (>=2.4.6) for numerical computations, and Jupyter (>=1.1.1) for an interactive environment. The code is organized by book chapters, and the complete implementation of Chapter 2 is currently available.

4

Section 04

Analysis of Core Concepts

  1. Data Loading: The input sequence is shifted right by one position as the target, and combined with a causal mask to ensure the model only predicts based on seen tokens; 2. Sliding Window and Stride: Adjusting the stride balances training efficiency and data coverage; 3. Word Embedding: A learnable lookup table where parameters are optimized end-to-end with the model; 4. Positional Encoding: Includes the learnable absolute positional encoding used by GPT, and relative positional encoding (concept introduction) which has length generalization capabilities.
5

Section 05

Code Structure and Target Audience

The Chapter 2 code includes: ch02.ipynb (concept explanation + code demonstration), dataloader.py (sliding window sampling logic), exercise-solutions.ipynb (exercise answers), notes.md (core knowledge points), and the-verdict.txt (sample data). Target audience: Developers who want to dive deep into Transformers, ML beginners with Python basics, educators, and researchers. The project allows free use, modification, and distribution.

6

Section 06

Summary and Outlook

This project provides developers with a clear learning path to understand large language models, helping them master tool usage and the principles behind design decisions. In the future, advanced topics like pre-training, fine-tuning, and inference optimization will be updated. It is an open-source project worth following and participating in.