Zing Forum

Reading

Building Large Models from Scratch: 23 Notebooks for a Full-Stack Understanding of Modern LLMs

A hands-on tutorial that implements core components of large models from scratch without using pre-built libraries, covering the complete tech stack from Tokenizer, Attention, MoE, RLHF to inference acceleration. Ideal for learners who want deep understanding rather than just knowing how to call APIs.

大语言模型PyTorchJupyter NotebookTransformerBPE TokenizerAttention机制MoERLHF推理加速知识蒸馏
Published 2026-05-21 14:15Recent activity 2026-05-21 14:19Estimated read 5 min
Building Large Models from Scratch: 23 Notebooks for a Full-Stack Understanding of Modern LLMs
1

Section 01

Introduction: 23 Notebooks to Build a Full-Stack Understanding of Modern LLMs from Scratch

A hands-on tutorial that implements core components of large models from scratch without using pre-built libraries, covering the complete tech stack from Tokenizer, Attention, MoE, RLHF to inference acceleration. Ideal for learners who want deep understanding rather than just knowing how to call APIs. The project uses 23 Jupyter Notebooks to help learners establish a full-stack understanding of modern LLMs.

2

Section 02

Background: Why Do We Need to 'Build Large Models from Scratch'?

Current learning resources for large language models have two shortcomings: one type is high-level paper reviews that explain principles but can't be turned into code; the other is API-calling tutorials that let you run things quickly but feel like a black box. The walkinglabs/modern-llm-notebook project fills this gap by requiring the use of PyTorch to implement core components from scratch, forcing learners to deal with tensor operations and gradient flow to build deep understanding.

3

Section 03

Methodology: A Complete Learning Path with Five Modules

The project is divided into five progressive modules:

  1. Basic Construction (Notebooks 01-05): Implement Tokenizer, positional encoding, Multi-Head Attention, and Mini-GPT skeleton;
  2. Training Techniques (06-14): Architecture optimization (LLaMA improvements, MoE), training workflow, data engineering, LoRA, RLHF;
  3. Inference Acceleration (15-17): Generation strategies, KV Cache, FlashAttention, speculative decoding;
  4. Cutting-Edge Exploration (18-20): Long context extension, Chain of Thought, VLM;
  5. Production Practice (21-23): Evaluation system, knowledge distillation, policy distillation. Each Notebook follows the cycle: 'Intuitive understanding → Manual calculation verification → Code implementation → Experimental observation'.
4

Section 04

Evidence: Direct Correspondence with Classic Papers

The project's core algorithms are closely linked to original papers:

Paper Notebook Implemented Content
Attention Is All You Need 04 Multi-Head Attention, Sinusoidal PE
LLaMA 06 RMSNorm, SwiGLU, RoPE
LoRA 12 Low-Rank Adaptation, A*B Decomposition
RLHF/PPO 14 Reward Model, PPO clip
This design allows learners to see runnable code right after reading the paper, deepening their understanding.
5

Section 05

Suggestions: Technical Threshold and Learning Guide

The project requires Python3.9+, PyTorch2.0+, and 16GB of memory. Most Notebooks can run on CPU; GPU is recommended for training. The Notebooks are modular, so you can jump to sections as needed:

  • Those with Transformer basics can skip to MoE or inference acceleration;
  • Those focusing on deployment can look at production practice;
  • Those wanting to complete their knowledge graph can follow the sequence. A React+Vite web reader is also provided to enhance the experience.
6

Section 06

Conclusion: Practical Value and Unique Positioning

Compared to tutorials like nanogpt, this project's uniqueness lies in its completeness (covering the full stack from Tokenizer to policy distillation) and cutting-edge nature (including 2024-2025 latest advances like speculative decoding and VLM). It is suitable for researchers, engineers, and students to deeply understand the internal mechanisms of large models. The deep understanding built through manual implementation is incomparable to just calling APIs.