Reading

On-Policy Post-Training Resource Library for Large Language Models: A Complete Technical Map from SFT to RLHF

This article introduces an open-source resource library that systematically organizes On-Policy post-training techniques for large language models, covering core methods such as online SFT, policy distillation, RLHF, RLVR, self-improvement, validator-guided learning, and search-based training, providing a one-stop learning path for researchers and engineers.

大语言模型On-Policy后训练RLHF在线SFT策略蒸馏强化学习奖励模型自我改进验证器引导学习开源资源

Published 2026-06-15 02:43Recent activity 2026-06-15 02:53Estimated read 8 min

Section 01

On-Policy Post-Training Resource Library for Large Language Models: A Complete Technical Map from SFT to RLHF (Introduction)

This article introduces an open-source resource library maintained by Masoud Jafaripour, which systematically organizes On-Policy post-training techniques for large language models. It covers core methods including online SFT, policy distillation, RLHF, RLVR, self-improvement, validator-guided learning, and search-based training, providing a one-stop learning path for researchers and engineers. The resource library is hosted on GitHub (link: https://github.com/Masoudjafaripour/Awesome-On-Policy-Post-Training-for-LLMs), uses the MIT license, was released on June 14, 2026, and continuously updates the latest developments in the field.

Section 02

Background: Why On-Policy Post-Training Matters So Much

The evolution of large language models (LLMs) has undergone a key transition from pre-training to post-training. The pre-training phase equips models with general language capabilities, but it is the post-training phase that gives them practical value and aligns them with human preferences. Traditional supervised fine-tuning (SFT) relies on static manually annotated data, making it difficult to capture the nuances of human preferences and impossible for models to learn continuously through interaction. On-Policy post-training methods allow models to learn in real time through interaction with the environment or humans, using policy gradient optimization to make outputs more consistent with desired behavioral patterns.

Section 03

Overview of the Resource Library: One-Stop Technical Navigation

This open-source resource library systematically organizes core papers, open-source code, review articles, and benchmark tests in the field of On-Policy post-training, classified by technical methods, and provides a clear learning path. The resource library uses the MIT license, allowing free use, modification, and distribution to help the community progress together.

Section 04

Analysis of Core Technical Methods

1. Online Supervised Fine-Tuning (Online SFT)

Traditional SFT uses static datasets, while online SFT dynamically generates samples and filters them. Its advantage is exploring a wider output space and selecting high-quality samples through self-assessment or external feedback. The challenge is avoiding the cyclic reinforcement of error patterns.

2. On-Policy Policy Distillation

Let the student model actively sample and learn the optimal strategy under the guidance of the teacher model. It has higher sample efficiency and better generalization ability than offline distillation.

3. RLHF: Learning from Human Feedback

By training a reward model to capture human preferences and using PPO (Proximal Policy Optimization) to optimize LLM outputs, it successfully converts subjective judgments into optimizable objectives. However, it faces challenges such as over-optimization of the reward model, annotation bias, and high computational costs.

4. RLVR: Reinforcement Learning with Verifiable Rewards

Using automatically verifiable reward signals (e.g., code execution results, mathematical answers), it performs well in reasoning tasks. Models like DeepSeek-R1 have proven its potential to improve reasoning capabilities.

5. Self-Improvement and Validator-Guided Learning

Self-improvement allows models to iteratively optimize without external supervision; validator-guided learning introduces an independent verification component to correct errors. The combination of the two has significant effects in complex reasoning tasks.

6. Search-Based Training Methods

Combining search algorithms (e.g., MCTS, beam search) to explore better reasoning paths, improving accuracy and interpretability.

Section 05

Practical Significance and Application Prospects

The resource library provides full-link support from theory to practice for AI researchers and engineers. Whether you want to understand the principles of RLHF or find code implementations, you can find references here. On-Policy post-training technology reshapes the capability boundaries of LLMs. Advanced models such as ChatGPT, Claude, Llama, and DeepSeek all adopt these methods extensively. Mastering these methods can help build more intelligent, reliable, and human-aligned AI systems.

Section 06

Conclusion and Learning Recommendations

The field of On-Policy post-training is developing rapidly, with new algorithms emerging endlessly. The value of this resource library lies in its continuous updates, helping researchers keep up with the latest progress. Recommended learning path: Establish an overall understanding from review articles, then read original papers on specific methods, and finally verify through open-source code practice, achieving a trinity of 'review-paper-code' learning.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23