Section 01
Safe Trigger: Guide to the Adaptive Alignment Method for Activating Latent Safety Awareness of Large Reasoning Models
Core Guide to the Safe Trigger Method
The research team proposes the Safe Trigger adaptive alignment method, which aims to activate the latent safety awareness of Large Reasoning Models (LRMs). Through two-stage training of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), this method achieves the following results on the DeepSeek-R1-Distill-Llama-8B model:
- Reduced harmful attack success rate by 24.65%
- Reduced jailbreak attack success rate by 36.72%
- Almost no impact on general performance
Source Information:
- Authors: Ke Miao, Jiaxin Li, Hongliang Chen, Yuke Hu, Zhan Qin
- Publication Platform: arXiv
- Publication Date: June 15, 2026
- Original Link: https://arxiv.org/abs/2606.16808