Zing Forum

Reading

Building a Large Language Model from Scratch: A Complete Handwritten LLM Training Workflow

This article introduces a complete project for building a large language model from scratch on Mac, covering 10 stages from data preparation to Ollama deployment, demonstrating a pure PyTorch implementation without relying on frameworks like HuggingFace.

大语言模型LLMPyTorch从零构建BPE分词器Transformer监督微调Ollama部署机器学习深度学习
Published 2026-06-09 02:13Recent activity 2026-06-09 02:21Estimated read 7 min
Building a Large Language Model from Scratch: A Complete Handwritten LLM Training Workflow
1

Section 01

Guide to the Full Workflow Project of Building LLM from Scratch: Pure PyTorch Implementation, Runable on Mac

This article introduces the open-source project "story-llm-finetuned-mac" created by developer sppandita85. This project builds a large language model from scratch on Mac, covering 10 stages from data preparation to Ollama deployment. It uses a pure PyTorch implementation without relying on frameworks like HuggingFace and supports CPU operation. Although the project is trained with only 50 moral stories (about 6000 tokens) (with memorization and overfitting issues), its workflow is consistent with industrial-grade LLMs, making it suitable for learners to understand internal mechanisms.

2

Section 02

Project Background and Design Philosophy

LLM training is often encapsulated as a "black box" by advanced frameworks, which is difficult to meet the in-depth learning needs of developers. This project takes "architecture fidelity" as its core design philosophy. Although the data scale is small, it completely reproduces the full workflow of industrial-grade LLM training, allowing learners to experience the LLM life cycle on personal Mac devices. The project uses a pure PyTorch implementation without relying on existing frameworks and supports CPU operation, lowering the entry barrier.

3

Section 03

Data Processing and Model Construction (Stages 1-4)

The project divides the training workflow into 10 stages:

  • Stage1 (Data Preparation):Clean raw markdown, insert special tokens, split into training/validation sets;
  • Stage2 (Tokenizer Training):Train a custom BPE tokenizer from scratch to handle out-of-vocabulary words;
  • Stage3 (Data Encoding):Encode text into token IDs, store as binary files, implement sliding window DataLoader;
  • Stage4 (Model Construction):Implement GPT architecture Transformer from scratch using PyTorch, including components like multi-head attention and feed-forward network, and verify model correctness.
4

Section 04

Pre-training and Supervised Fine-tuning (Stages5-8)

  • Stage5 (Pre-training):Use AdamW optimizer, combined with warmup and cosine annealing learning rate, implement gradient clipping and checkpoint saving;
  • Stage6 (Text Generation):Sample text generation from pre-trained model to evaluate pre-training effect;
  • Stage7 (Q&A Dataset Construction):Derive instruction Q&A pairs from pre-trained corpus and convert to dialogue training format;
  • Stage8 (Supervised Fine-tuning):Train with Q&A dataset, adopt mask loss strategy (only calculate loss on answer part) to let the model learn to follow instructions.
5

Section 05

Interaction and Deployment (Stages9-10)

  • Stage9 (Dialogue Interaction):Provide command-line interface for users to interactively converse with the fine-tuned model;
  • Stage10 (Ollama Deployment):Convert model to GGUF format (quantization reduces memory usage) and deploy to Ollama platform for easy user access.
6

Section 06

Technical Highlights and Scalability

The project code is well-organized, with shared code stored in the common directory (including configuration, tokenizer, model, etc.), and the modular design is easy to extend. To scale to real-scale training, you only need to modify hyperparameters in common/config.py: increase vocabulary size, number of layers, number of attention heads, embedding dimension, increase training epochs, point to a larger corpus, and switch to GPU device.

7

Section 07

Learning Value and Practical Significance

This project provides an excellent entry path for LLM learners. By running each stage, you can establish a full-process understanding (data processing, tokenizer, Transformer, optimization strategy, deployment). The author provides a Model Card to record model information and publishes the model to the Ollama platform (ollama.com/sppandita85/story-llm) for easy direct experience.

8

Section 08

Project Summary

The "story-llm-finetuned-mac" project is small in scale but complete in workflow. The pure PyTorch implementation allows learners to understand the essence of each technical link. For developers who want to master LLM technology at the principle level, it is an excellent open-source project worth in-depth study.