Section 01
[Introduction] Self-ReSET: Enabling Large Language Models to Self-Recover from Dangerous Reasoning Trajectories
Self-ReSET is a pure reinforcement learning framework whose core innovation is enabling models to learn recovery capabilities from their own safety error trajectories, significantly enhancing robustness against adversarial attacks (especially out-of-distribution jailbreak prompts) while preserving general capabilities. This framework uses dynamic on-policy reasoning trajectory training to eliminate the gap between static data and dynamic behavior, providing a new direction for the safety alignment of reasoning models.