Reading

Panoramic View of Large Model Post-Training Technologies: The Evolution from Online SFT to Reasoning Models

An in-depth analysis of the Awesome-On-Policy-Post-Training-for-LLMs repository, systematically organizing the core methodologies in the post-training phase of large language models, including key technical paths such as online supervised fine-tuning, distillation, and reinforcement learning.

大语言模型后训练在线监督微调蒸馏强化学习RLHF推理模型DeepSeek-R1自改进GitHub

Published 2026-06-15 02:43Recent activity 2026-06-15 02:48Estimated read 7 min

Panoramic View of Large Model Post-Training Technologies: The Evolution from Online SFT to Reasoning Models

Section 01

[Introduction] Panoramic View of Large Model Post-Training Technologies: Core Methodologies and Repository Analysis

This article provides an in-depth analysis of the Awesome-On-Policy-Post-Training-for-LLMs repository, systematically organizing the core methodologies in the post-training phase of large language models, including key technical paths such as online supervised fine-tuning, distillation, and reinforcement learning, revealing the evolution from online SFT to reasoning models. The post-training phase determines whether a model can solve complex tasks and possess reasoning capabilities. This repository focuses on "online policy" methods, providing a complete technical map for researchers and practitioners.

Section 02

Background: Large Model Training Phases and Repository Source

Large language model training is divided into two phases: pre-training (acquiring general language knowledge) and post-training (determining complex task capabilities). This repository is maintained by Masoud Jafaripour, published on GitHub (link: https://github.com/Masoudjafaripour/Awesome-On-Policy-Post-Training-for-LLMs) on June 14, 2026. It focuses on "online policy" methods—training data is generated and improved in real-time by the current model's policy.

Section 03

Method (1): Online Supervised Fine-Tuning and Distillation Technologies

Online Supervised Fine-Tuning (Online SFT)：Continuously collect self-generated trajectories for supervised learning, breaking through the limitations of manual annotation. Representative works include Self-Instruct (2022) and ReST (2023).

Distillation Technologies are divided into three types:

Offline distillation (teacher model trajectories are collected offline to train students, e.g., Distilling Step-by-Step, DeepSeek-R1-Distill Models);
Self-distillation (learning from self-generated traces, e.g., STaR, Quiet-STaR, Self-Rewarding Language Models);
Online policy distillation (current model generates data and improves, e.g., ReST-EM, DeepSeek-R1, Tree of Thoughts, RAP).

Section 04

Method (2): Reinforcement Learning and Validator-Guided Learning

Applications of Reinforcement Learning:

RLHF (Reinforcement Learning from Human Feedback, e.g., InstructGPT, Constitutional AI);
RLVR (Reinforcement Learning with Verifiable Rewards, e.g., DeepSeekMath, DeepSeek-R1);
Online Preference Learning (e.g., DPO, Online DPO).

Validator-Guided Learning: Guides learning through process/result verification, e.g., Let's Verify Step by Step, Self-Rewarding Language Models.

Section 05

Method (3): Search-Based and Self-Improvement Technologies

Search-Based Learning: Uses search to generate high-quality reasoning trajectories, e.g., Tree of Thoughts (tree search), RAP (Monte Carlo Tree Search), VReST (combining validator), Socratic-MCTS (Socratic questioning + MCTS).

Self-Improvement and Self-Play: Iteratively improves through self-outputs, e.g., STaR (generate-verify-fine-tune cycle), Reflexion (self-reflection), Quiet-STaR (implicit chain of thought).

Section 06

Evidence: Cutting-Edge Reasoning Models, Evaluation Benchmarks, and Open-Source Frameworks

Cutting-Edge Reasoning Models: o1/o3 (OpenAI), DeepSeek-R1 (open-source, emergent reasoning from pure RL training), QwQ (Alibaba Cloud), Kimi Reasoning Models (MoonShot AI).

Evaluation Benchmarks: GSM8K/MATH/AIME (mathematics), GPQA (graduate-level science), MMLU-Pro (multi-disciplinary), SWE-Bench (software engineering), BrowseComp (browser usage).

Open-Source Frameworks: TRL (Hugging Face), OpenRLHF, verl (ByteDance), DeepSpeed-Chat (Microsoft), Megatron-LM (NVIDIA).

Section 07

Conclusion: Technology Evolution Trends and Repository Value

The post-training technologies for large models show four major trends:

From offline to online (dynamically generating data);
From result-focused to process-focused (paying attention to reasoning processes);
From human-dependent to automatic (automatic validators replacing human feedback);
From single to combined (using multiple technologies together).

This repository provides a technical map for researchers and points out the path for practitioners to build reasoning models. The success of open-source models like DeepSeek-R1 indicates that efficient reasoning will become a standard feature of large models.

Panoramic View of Large Model Post-Training Technologies: The Evolution from Online SFT to Reasoning Models

[Introduction] Panoramic View of Large Model Post-Training Technologies: Core Methodologies and Repository Analysis

Background: Large Model Training Phases and Repository Source

Method (1): Online Supervised Fine-Tuning and Distillation Technologies

Method (2): Reinforcement Learning and Validator-Guided Learning

Method (3): Search-Based and Self-Improvement Technologies

Evidence: Cutting-Edge Reasoning Models, Evaluation Benchmarks, and Open-Source Frameworks

Conclusion: Technology Evolution Trends and Repository Value

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

Graph Neural Networks Revolutionize Global Weather Forecasting: From Graph Weather to Open-Source Practice of Multi-Model Fusion

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

Vertica Expert Skills: A One-Stop Guide to Enterprise Database Migration and Optimization