Section 01
【Introduction】SePT: Core Analysis of a Reward-Model-Free Self-Training Reasoning Framework for LLMs
SePT (Self-Training with Process Rewards) is a novel reward-model-free self-training method designed to enable large language models (LLMs) to continuously improve their reasoning capabilities through self-generated process reward signals, opening up a new path to reduce RLHF training costs. Its core idea is "process as reward": by generating candidate reasoning paths, self-evaluating the quality of the process, and bootstrapping to learn effective strategies, it addresses the bottlenecks of traditional RLHF such as reliance on expensive annotated data and poor generalization of reward models. It has shown excellent experimental performance and significant application value.