1. Online Supervised Fine-Tuning (Online SFT)
Traditional SFT uses static datasets, while online SFT dynamically generates samples and filters them. Its advantage is exploring a wider output space and selecting high-quality samples through self-assessment or external feedback. The challenge is avoiding the cyclic reinforcement of error patterns.
2. On-Policy Policy Distillation
Let the student model actively sample and learn the optimal strategy under the guidance of the teacher model. It has higher sample efficiency and better generalization ability than offline distillation.
3. RLHF: Learning from Human Feedback
By training a reward model to capture human preferences and using PPO (Proximal Policy Optimization) to optimize LLM outputs, it successfully converts subjective judgments into optimizable objectives. However, it faces challenges such as over-optimization of the reward model, annotation bias, and high computational costs.
4. RLVR: Reinforcement Learning with Verifiable Rewards
Using automatically verifiable reward signals (e.g., code execution results, mathematical answers), it performs well in reasoning tasks. Models like DeepSeek-R1 have proven its potential to improve reasoning capabilities.
5. Self-Improvement and Validator-Guided Learning
Self-improvement allows models to iteratively optimize without external supervision; validator-guided learning introduces an independent verification component to correct errors. The combination of the two has significant effects in complex reasoning tasks.
6. Search-Based Training Methods
Combining search algorithms (e.g., MCTS, beam search) to explore better reasoning paths, improving accuracy and interpretability.