# On-Policy Post-Training Resource Library for Large Language Models: A Complete Technical Map from SFT to RLHF

> This article introduces an open-source resource library that systematically organizes On-Policy post-training techniques for large language models, covering core methods such as online SFT, policy distillation, RLHF, RLVR, self-improvement, validator-guided learning, and search-based training, providing a one-stop learning path for researchers and engineers.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-14T18:43:21.000Z
- 最近活动: 2026-06-14T18:53:01.864Z
- 热度: 145.8
- 关键词: 大语言模型, On-Policy后训练, RLHF, 在线SFT, 策略蒸馏, 强化学习, 奖励模型, 自我改进, 验证器引导学习, 开源资源
- 页面链接: https://www.zingnex.cn/en/forum/thread/on-policy-sftrlhf
- Canonical: https://www.zingnex.cn/forum/thread/on-policy-sftrlhf
- Markdown 来源: floors_fallback

---

## On-Policy Post-Training Resource Library for Large Language Models: A Complete Technical Map from SFT to RLHF (Introduction)

This article introduces an open-source resource library maintained by Masoud Jafaripour, which systematically organizes On-Policy post-training techniques for large language models. It covers core methods including online SFT, policy distillation, RLHF, RLVR, self-improvement, validator-guided learning, and search-based training, providing a one-stop learning path for researchers and engineers. The resource library is hosted on GitHub (link: https://github.com/Masoudjafaripour/Awesome-On-Policy-Post-Training-for-LLMs), uses the MIT license, was released on June 14, 2026, and continuously updates the latest developments in the field.

## Background: Why On-Policy Post-Training Matters So Much

The evolution of large language models (LLMs) has undergone a key transition from pre-training to post-training. The pre-training phase equips models with general language capabilities, but it is the post-training phase that gives them practical value and aligns them with human preferences. Traditional supervised fine-tuning (SFT) relies on static manually annotated data, making it difficult to capture the nuances of human preferences and impossible for models to learn continuously through interaction. On-Policy post-training methods allow models to learn in real time through interaction with the environment or humans, using policy gradient optimization to make outputs more consistent with desired behavioral patterns.

## Overview of the Resource Library: One-Stop Technical Navigation

This open-source resource library systematically organizes core papers, open-source code, review articles, and benchmark tests in the field of On-Policy post-training, classified by technical methods, and provides a clear learning path. The resource library uses the MIT license, allowing free use, modification, and distribution to help the community progress together.

## Analysis of Core Technical Methods

### 1. Online Supervised Fine-Tuning (Online SFT)
Traditional SFT uses static datasets, while online SFT dynamically generates samples and filters them. Its advantage is exploring a wider output space and selecting high-quality samples through self-assessment or external feedback. The challenge is avoiding the cyclic reinforcement of error patterns.

### 2. On-Policy Policy Distillation
Let the student model actively sample and learn the optimal strategy under the guidance of the teacher model. It has higher sample efficiency and better generalization ability than offline distillation.

### 3. RLHF: Learning from Human Feedback
By training a reward model to capture human preferences and using PPO (Proximal Policy Optimization) to optimize LLM outputs, it successfully converts subjective judgments into optimizable objectives. However, it faces challenges such as over-optimization of the reward model, annotation bias, and high computational costs.

### 4. RLVR: Reinforcement Learning with Verifiable Rewards
Using automatically verifiable reward signals (e.g., code execution results, mathematical answers), it performs well in reasoning tasks. Models like DeepSeek-R1 have proven its potential to improve reasoning capabilities.

### 5. Self-Improvement and Validator-Guided Learning
Self-improvement allows models to iteratively optimize without external supervision; validator-guided learning introduces an independent verification component to correct errors. The combination of the two has significant effects in complex reasoning tasks.

### 6. Search-Based Training Methods
Combining search algorithms (e.g., MCTS, beam search) to explore better reasoning paths, improving accuracy and interpretability.

## Practical Significance and Application Prospects

The resource library provides full-link support from theory to practice for AI researchers and engineers. Whether you want to understand the principles of RLHF or find code implementations, you can find references here. On-Policy post-training technology reshapes the capability boundaries of LLMs. Advanced models such as ChatGPT, Claude, Llama, and DeepSeek all adopt these methods extensively. Mastering these methods can help build more intelligent, reliable, and human-aligned AI systems.

## Conclusion and Learning Recommendations

The field of On-Policy post-training is developing rapidly, with new algorithms emerging endlessly. The value of this resource library lies in its continuous updates, helping researchers keep up with the latest progress. Recommended learning path: Establish an overall understanding from review articles, then read original papers on specific methods, and finally verify through open-source code practice, achieving a trinity of 'review-paper-code' learning.
