Zing Forum

Reading

Building Reasoning Models from Scratch: O'Reilly Course Helps You Deeply Understand the Reasoning Mechanisms of o1, DeepSeek R1, and Gemini 2.0

This is a complete set of O'Reilly hands-on course materials. By building a DeepSeek R1-style reasoning model training process from scratch, it helps learners deeply understand the working principles of modern reasoning models, including core technologies like Chain of Thought (CoT) and GRPO reinforcement learning.

推理模型DeepSeek R1思维链GRPO强化学习O'Reilly课程AI训练大语言模型
Published 2026-04-07 20:06Recent activity 2026-04-07 20:19Estimated read 6 min
Building Reasoning Models from Scratch: O'Reilly Course Helps You Deeply Understand the Reasoning Mechanisms of o1, DeepSeek R1, and Gemini 2.0
1

Section 01

Introduction: O'Reilly Course Guides You to Build Reasoning Models from Scratch and Deeply Understand Core Mechanisms

This hands-on course from O'Reilly helps learners deeply understand the working principles of modern reasoning models (such as o1, DeepSeek R1, Gemini 2.0) by building a DeepSeek R1-style reasoning model training process from scratch. It covers core key technologies like Chain of Thought (CoT) and GRPO reinforcement learning. The course emphasizes practicality, allowing learners to fully master the reasoning model building process from theory to code.

2

Section 02

Background: The Rise of Reasoning Models and the Concept of Chain of Thought

With the rise of reasoning models like OpenAI's o-series and DeepSeek R1, the AI field is undergoing a paradigm shift from 'quick answers' to 'deep thinking'. The difference between reasoning models and traditional large language models lies in their ability to generate intermediate thinking steps (Chain of Thought), which requires specific post-training techniques to acquire. The course first helps learners build an intuitive understanding of Chain of Thought and reveals its evolution mechanism from a prompting technique to an endogenous ability.

3

Section 03

Core Method: The Five-Stage Training Process of DeepSeek R1

The core of the course is the five-stage training process proposed in the DeepSeek R1 paper:

  1. Pre-training: The foundational stage, trained with autoregressive language modeling objectives, which determines the upper limit of the model's language understanding;
  2. Cold-start Supervised Fine-tuning (SFT): A key innovation, using a small number of high-quality reasoning examples for fine-tuning to enable the model to learn structured expression of thinking;
  3. GRPO Reinforcement Learning: The technical core, which does not require a value network, estimates the advantage function through relative rewards of in-group samples, reducing training costs. The course provides a complete PyTorch implementation;
  4. Rejection Sampling SFT: Selecting high-quality reasoning trajectories for a second round of fine-tuning to improve quality;
  5. Distillation: Distilling the model into a smaller one for deployment in resource-constrained environments.
4

Section 04

Hands-on Practice: From Notebooks to Demo Applications

The course provides a complete series of Jupyter Notebooks (corresponding to each training stage), with code comments and visualizations, supporting step-by-step following or jumping to any stage. The accompanying demo applications include:

  • Math problem solver: Comparing the differences between direct answers and Chain of Thought reasoning;
  • Logic puzzle solver: Demonstrating multi-step reasoning and hypothesis testing;
  • Planning agent: Showing subtask decomposition and execution plan generation in task planning. It also provides a model selection decision tree and comparison tools.
5

Section 05

Flexible Usage: Three Learning Methods for the Course

The course supports three flexible usage methods:

  1. GitHub Codespaces (Recommended): Complete environment configuration in the browser, supporting API key setup;
  2. Local Run: Use the uv package manager to set up the environment (Python 3.11+);
  3. Existing Environment: Directly clone the repository and run the notebooks (requires familiarity with Jupyter and PyTorch).
6

Section 06

Course Value: Why Is It Worth Paying Attention To?

In today's era where the importance of reasoning models is increasingly prominent, just calling APIs is no longer sufficient. The unique feature of this course is that it guides learners to build reasoning models with their own hands, master technical details like GRPO and rejection sampling, and establish a deep understanding of the essence of reasoning models. It is suitable for developers, researchers, and technical decision-makers who want to deeply understand the principles of models like o1 and DeepSeek R1.