Zing Forum

Reading

Panoramic View of Large Model Post-Training Technologies: The Evolution from Online SFT to Reasoning Models

An in-depth analysis of the Awesome-On-Policy-Post-Training-for-LLMs repository, systematically organizing the core methodologies in the post-training phase of large language models, including key technical paths such as online supervised fine-tuning, distillation, and reinforcement learning.

大语言模型后训练在线监督微调蒸馏强化学习RLHF推理模型DeepSeek-R1自改进GitHub
Published 2026-06-15 02:43Recent activity 2026-06-15 02:48Estimated read 7 min
Panoramic View of Large Model Post-Training Technologies: The Evolution from Online SFT to Reasoning Models
1

Section 01

[Introduction] Panoramic View of Large Model Post-Training Technologies: Core Methodologies and Repository Analysis

This article provides an in-depth analysis of the Awesome-On-Policy-Post-Training-for-LLMs repository, systematically organizing the core methodologies in the post-training phase of large language models, including key technical paths such as online supervised fine-tuning, distillation, and reinforcement learning, revealing the evolution from online SFT to reasoning models. The post-training phase determines whether a model can solve complex tasks and possess reasoning capabilities. This repository focuses on "online policy" methods, providing a complete technical map for researchers and practitioners.

2

Section 02

Background: Large Model Training Phases and Repository Source

Large language model training is divided into two phases: pre-training (acquiring general language knowledge) and post-training (determining complex task capabilities). This repository is maintained by Masoud Jafaripour, published on GitHub (link: https://github.com/Masoudjafaripour/Awesome-On-Policy-Post-Training-for-LLMs) on June 14, 2026. It focuses on "online policy" methods—training data is generated and improved in real-time by the current model's policy.

3

Section 03

Method (1): Online Supervised Fine-Tuning and Distillation Technologies

Online Supervised Fine-Tuning (Online SFT):Continuously collect self-generated trajectories for supervised learning, breaking through the limitations of manual annotation. Representative works include Self-Instruct (2022) and ReST (2023).

Distillation Technologies are divided into three types:

  1. Offline distillation (teacher model trajectories are collected offline to train students, e.g., Distilling Step-by-Step, DeepSeek-R1-Distill Models);
  2. Self-distillation (learning from self-generated traces, e.g., STaR, Quiet-STaR, Self-Rewarding Language Models);
  3. Online policy distillation (current model generates data and improves, e.g., ReST-EM, DeepSeek-R1, Tree of Thoughts, RAP).
4

Section 04

Method (2): Reinforcement Learning and Validator-Guided Learning

Applications of Reinforcement Learning:

  1. RLHF (Reinforcement Learning from Human Feedback, e.g., InstructGPT, Constitutional AI);
  2. RLVR (Reinforcement Learning with Verifiable Rewards, e.g., DeepSeekMath, DeepSeek-R1);
  3. Online Preference Learning (e.g., DPO, Online DPO).

Validator-Guided Learning: Guides learning through process/result verification, e.g., Let's Verify Step by Step, Self-Rewarding Language Models.

5

Section 05

Method (3): Search-Based and Self-Improvement Technologies

Search-Based Learning: Uses search to generate high-quality reasoning trajectories, e.g., Tree of Thoughts (tree search), RAP (Monte Carlo Tree Search), VReST (combining validator), Socratic-MCTS (Socratic questioning + MCTS).

Self-Improvement and Self-Play: Iteratively improves through self-outputs, e.g., STaR (generate-verify-fine-tune cycle), Reflexion (self-reflection), Quiet-STaR (implicit chain of thought).

6

Section 06

Evidence: Cutting-Edge Reasoning Models, Evaluation Benchmarks, and Open-Source Frameworks

Cutting-Edge Reasoning Models: o1/o3 (OpenAI), DeepSeek-R1 (open-source, emergent reasoning from pure RL training), QwQ (Alibaba Cloud), Kimi Reasoning Models (MoonShot AI).

Evaluation Benchmarks: GSM8K/MATH/AIME (mathematics), GPQA (graduate-level science), MMLU-Pro (multi-disciplinary), SWE-Bench (software engineering), BrowseComp (browser usage).

Open-Source Frameworks: TRL (Hugging Face), OpenRLHF, verl (ByteDance), DeepSpeed-Chat (Microsoft), Megatron-LM (NVIDIA).

7

Section 07

Conclusion: Technology Evolution Trends and Repository Value

The post-training technologies for large models show four major trends:

  1. From offline to online (dynamically generating data);
  2. From result-focused to process-focused (paying attention to reasoning processes);
  3. From human-dependent to automatic (automatic validators replacing human feedback);
  4. From single to combined (using multiple technologies together).

This repository provides a technical map for researchers and points out the path for practitioners to build reasoning models. The success of open-source models like DeepSeek-R1 indicates that efficient reasoning will become a standard feature of large models.