Zing Forum

Reading

On-Policy Post-Training Resource Library for Large Language Models: A Complete Technical Map from SFT to RLHF

This article introduces an open-source resource library that systematically organizes On-Policy post-training techniques for large language models, covering core methods such as online SFT, policy distillation, RLHF, RLVR, self-improvement, validator-guided learning, and search-based training, providing a one-stop learning path for researchers and engineers.

大语言模型On-Policy后训练RLHF在线SFT策略蒸馏强化学习奖励模型自我改进验证器引导学习开源资源
Published 2026-06-15 02:43Recent activity 2026-06-15 02:53Estimated read 8 min
On-Policy Post-Training Resource Library for Large Language Models: A Complete Technical Map from SFT to RLHF
1

Section 01

On-Policy Post-Training Resource Library for Large Language Models: A Complete Technical Map from SFT to RLHF (Introduction)

This article introduces an open-source resource library maintained by Masoud Jafaripour, which systematically organizes On-Policy post-training techniques for large language models. It covers core methods including online SFT, policy distillation, RLHF, RLVR, self-improvement, validator-guided learning, and search-based training, providing a one-stop learning path for researchers and engineers. The resource library is hosted on GitHub (link: https://github.com/Masoudjafaripour/Awesome-On-Policy-Post-Training-for-LLMs), uses the MIT license, was released on June 14, 2026, and continuously updates the latest developments in the field.

2

Section 02

Background: Why On-Policy Post-Training Matters So Much

The evolution of large language models (LLMs) has undergone a key transition from pre-training to post-training. The pre-training phase equips models with general language capabilities, but it is the post-training phase that gives them practical value and aligns them with human preferences. Traditional supervised fine-tuning (SFT) relies on static manually annotated data, making it difficult to capture the nuances of human preferences and impossible for models to learn continuously through interaction. On-Policy post-training methods allow models to learn in real time through interaction with the environment or humans, using policy gradient optimization to make outputs more consistent with desired behavioral patterns.

3

Section 03

Overview of the Resource Library: One-Stop Technical Navigation

This open-source resource library systematically organizes core papers, open-source code, review articles, and benchmark tests in the field of On-Policy post-training, classified by technical methods, and provides a clear learning path. The resource library uses the MIT license, allowing free use, modification, and distribution to help the community progress together.

4

Section 04

Analysis of Core Technical Methods

1. Online Supervised Fine-Tuning (Online SFT)

Traditional SFT uses static datasets, while online SFT dynamically generates samples and filters them. Its advantage is exploring a wider output space and selecting high-quality samples through self-assessment or external feedback. The challenge is avoiding the cyclic reinforcement of error patterns.

2. On-Policy Policy Distillation

Let the student model actively sample and learn the optimal strategy under the guidance of the teacher model. It has higher sample efficiency and better generalization ability than offline distillation.

3. RLHF: Learning from Human Feedback

By training a reward model to capture human preferences and using PPO (Proximal Policy Optimization) to optimize LLM outputs, it successfully converts subjective judgments into optimizable objectives. However, it faces challenges such as over-optimization of the reward model, annotation bias, and high computational costs.

4. RLVR: Reinforcement Learning with Verifiable Rewards

Using automatically verifiable reward signals (e.g., code execution results, mathematical answers), it performs well in reasoning tasks. Models like DeepSeek-R1 have proven its potential to improve reasoning capabilities.

5. Self-Improvement and Validator-Guided Learning

Self-improvement allows models to iteratively optimize without external supervision; validator-guided learning introduces an independent verification component to correct errors. The combination of the two has significant effects in complex reasoning tasks.

6. Search-Based Training Methods

Combining search algorithms (e.g., MCTS, beam search) to explore better reasoning paths, improving accuracy and interpretability.

5

Section 05

Practical Significance and Application Prospects

The resource library provides full-link support from theory to practice for AI researchers and engineers. Whether you want to understand the principles of RLHF or find code implementations, you can find references here. On-Policy post-training technology reshapes the capability boundaries of LLMs. Advanced models such as ChatGPT, Claude, Llama, and DeepSeek all adopt these methods extensively. Mastering these methods can help build more intelligent, reliable, and human-aligned AI systems.

6

Section 06

Conclusion and Learning Recommendations

The field of On-Policy post-training is developing rapidly, with new algorithms emerging endlessly. The value of this resource library lies in its continuous updates, helping researchers keep up with the latest progress. Recommended learning path: Establish an overall understanding from review articles, then read original papers on specific methods, and finally verify through open-source code practice, achieving a trinity of 'review-paper-code' learning.