Reading

Building Reasoning Models from Scratch: O'Reilly Course Helps You Deeply Understand the Reasoning Mechanisms of o1, DeepSeek R1, and Gemini 2.0

This is a complete set of O'Reilly hands-on course materials. By building a DeepSeek R1-style reasoning model training process from scratch, it helps learners deeply understand the working principles of modern reasoning models, including core technologies like Chain of Thought (CoT) and GRPO reinforcement learning.

推理模型DeepSeek R1思维链GRPO强化学习O'Reilly课程AI训练大语言模型

Published 2026-04-07 20:06Recent activity 2026-04-07 20:19Estimated read 6 min

Building Reasoning Models from Scratch: O'Reilly Course Helps You Deeply Understand the Reasoning Mechanisms of o1, DeepSeek R1, and Gemini 2.0

Section 01

Introduction: O'Reilly Course Guides You to Build Reasoning Models from Scratch and Deeply Understand Core Mechanisms

This hands-on course from O'Reilly helps learners deeply understand the working principles of modern reasoning models (such as o1, DeepSeek R1, Gemini 2.0) by building a DeepSeek R1-style reasoning model training process from scratch. It covers core key technologies like Chain of Thought (CoT) and GRPO reinforcement learning. The course emphasizes practicality, allowing learners to fully master the reasoning model building process from theory to code.

Section 02

Background: The Rise of Reasoning Models and the Concept of Chain of Thought

With the rise of reasoning models like OpenAI's o-series and DeepSeek R1, the AI field is undergoing a paradigm shift from 'quick answers' to 'deep thinking'. The difference between reasoning models and traditional large language models lies in their ability to generate intermediate thinking steps (Chain of Thought), which requires specific post-training techniques to acquire. The course first helps learners build an intuitive understanding of Chain of Thought and reveals its evolution mechanism from a prompting technique to an endogenous ability.

Section 03

Core Method: The Five-Stage Training Process of DeepSeek R1

The core of the course is the five-stage training process proposed in the DeepSeek R1 paper:

Pre-training: The foundational stage, trained with autoregressive language modeling objectives, which determines the upper limit of the model's language understanding;
Cold-start Supervised Fine-tuning (SFT): A key innovation, using a small number of high-quality reasoning examples for fine-tuning to enable the model to learn structured expression of thinking;
GRPO Reinforcement Learning: The technical core, which does not require a value network, estimates the advantage function through relative rewards of in-group samples, reducing training costs. The course provides a complete PyTorch implementation;
Rejection Sampling SFT: Selecting high-quality reasoning trajectories for a second round of fine-tuning to improve quality;
Distillation: Distilling the model into a smaller one for deployment in resource-constrained environments.

Section 04

Hands-on Practice: From Notebooks to Demo Applications

The course provides a complete series of Jupyter Notebooks (corresponding to each training stage), with code comments and visualizations, supporting step-by-step following or jumping to any stage. The accompanying demo applications include:

Math problem solver: Comparing the differences between direct answers and Chain of Thought reasoning;
Logic puzzle solver: Demonstrating multi-step reasoning and hypothesis testing;
Planning agent: Showing subtask decomposition and execution plan generation in task planning. It also provides a model selection decision tree and comparison tools.

Section 05

Flexible Usage: Three Learning Methods for the Course

The course supports three flexible usage methods:

GitHub Codespaces (Recommended): Complete environment configuration in the browser, supporting API key setup;
Local Run: Use the uv package manager to set up the environment (Python 3.11+);
Existing Environment: Directly clone the repository and run the notebooks (requires familiarity with Jupyter and PyTorch).

Section 06

Course Value: Why Is It Worth Paying Attention To?

In today's era where the importance of reasoning models is increasingly prominent, just calling APIs is no longer sufficient. The unique feature of this course is that it guides learners to build reasoning models with their own hands, master technical details like GRPO and rejection sampling, and establish a deep understanding of the essence of reasoning models. It is suitable for developers, researchers, and technical decision-makers who want to deeply understand the principles of models like o1 and DeepSeek R1.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15