Zing Forum

Reading

Safe Trigger: An Adaptive Alignment Method to Activate the Latent Safety Awareness of Large Reasoning Models

Researchers found that large reasoning models have latent safety awareness and can identify security risks through self-reflection. The Safe Trigger method, trained via SFT and DPO, reduces the success rate of harmful attacks by 24.65% and jailbreak attacks by 36.72% on DeepSeek-R1-Distill-Llama-8B, with almost no impact on general performance.

大推理模型安全对齐越狱攻击监督微调直接偏好优化LRMsafety alignmentjailbreak
Published 2026-06-15 22:51Recent activity 2026-06-16 12:22Estimated read 8 min
Safe Trigger: An Adaptive Alignment Method to Activate the Latent Safety Awareness of Large Reasoning Models
1

Section 01

Safe Trigger: Guide to the Adaptive Alignment Method for Activating Latent Safety Awareness of Large Reasoning Models

Core Guide to the Safe Trigger Method

The research team proposes the Safe Trigger adaptive alignment method, which aims to activate the latent safety awareness of Large Reasoning Models (LRMs). Through two-stage training of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), this method achieves the following results on the DeepSeek-R1-Distill-Llama-8B model:

  • Reduced harmful attack success rate by 24.65%
  • Reduced jailbreak attack success rate by 36.72%
  • Almost no impact on general performance

Source Information:

  • Authors: Ke Miao, Jiaxin Li, Hongliang Chen, Yuke Hu, Zhan Qin
  • Publication Platform: arXiv
  • Publication Date: June 15, 2026
  • Original Link: https://arxiv.org/abs/2606.16808
2

Section 02

Research Background: The Security Dilemma of Large Reasoning Models

Security Challenges of Large Reasoning Models

LRMs (such as DeepSeek-R1, OpenAI o-series) excel in complex tasks with explicit Chain-of-Thought reasoning, but they also bring new security issues:

  1. Escalated Jailbreak Attacks: Attackers use reasoning capabilities to bypass security mechanisms through complex prompts like multi-turn dialogues and role-playing.
  2. Limitations of Existing Alignment Methods:
    • High cost of manual annotation: High-quality security datasets require a large number of professionals;
    • Limited coverage: It is difficult for humans to exhaust all attack variants;
    • Trade-off between performance and security: Over-alignment impairs general capabilities and user experience.
3

Section 03

Core Finding: Latent Safety Awareness of Large Models

Discovery of Latent Safety Awareness

The research team observed that when the original query is presented together with the model's own reasoning trajectory, the model can identify security risks—this ability is called latent safety awareness. Key Insight: When generating a reasoning chain, the LRM already "realizes" the potential problems of the request but does not convert it into a safe response; triggering this awareness can achieve security alignment without external annotation.

4

Section 04

Safe Trigger Method: Two-Stage Training Mechanism

Detailed Explanation of the Safe Trigger Method

Based on latent safety awareness, this method activates safe responses through two-stage training:

Stage 1: SFT-Induced Safety Labels

  • Adaptive Trigger: Normal queries maintain standard responses; unsafe queries are inserted with safety labels before conducting security analysis;
  • Bootstrapped Training Data: Use the model's own generated reasoning chains to filter positive/negative examples, eliminating dependence on manual annotation;
  • Explicit Label Design: Safety labels act as a "switch" to toggle between normal reasoning and security analysis modes.

Stage 2: DPO-Optimized Security Analysis

  • Preference Pair Construction: Generate paired samples of correct rejection (positive example) and incorrect response (negative example) for unsafe queries;
  • Stability Enhancement: Improve the accuracy of security analysis and enhance robustness against prompt variants.
5

Section 05

Experimental Results: Security Improvement and General Performance Preservation

Experimental Validation Results

Tests on DeepSeek-R1-Distill-Llama-8B show:

  1. Security Performance Improvement:
    • Harmful query attack success rate decreased by 24.65%;
    • Jailbreak attack success rate decreased by 36.72%.
  2. No Loss of General Performance: Standard capability benchmarks, user experience, and response quality for normal reasoning tasks all remain at their original levels.
  3. Cross-Model Transfer: The method can achieve similar security improvements across different LRM architectures.
6

Section 06

Technical Contributions: Bootstrapped Alignment and Explicit Triggering

Technical Contributions and Methodological Significance

Core contributions of Safe Trigger:

  1. Bootstrapped Alignment Paradigm: The model aligns via self-generated data, reducing dependence on manual annotation and being scalable to other alignment tasks;
  2. Explicit Safety Triggering: Safety labels enable controllable and interpretable safe behavior, facilitating audit and debugging;
  3. Minimal Intervention Principle: The security mechanism is only triggered when risks are detected, avoiding interference with normal dialogues.
7

Section 07

Limitations and Future Research Directions

Limitations and Future Directions

The current method has the following limitations and improvement directions:

  1. Attack Adaptability: Need to address adversarial attacks targeting Safe Trigger;
  2. Multilingual Safety: Need to verify alignment effects in language scenarios other than English;
  3. Generalization of Safety Labels: Explore more general triggering mechanisms to improve the method's universality.
8

Section 08

Conclusion: The Value and Significance of Safe Trigger

Research Conclusion

Safe Trigger achieves efficient adaptive security alignment by activating the latent safety awareness of LRMs:

  • Eliminates dependence on manual annotation;
  • Significantly improves security (reduces attack success rates);
  • Preserves general performance. This research provides new ideas for LRM security alignment and lays the foundation for bootstrapped model alignment methods.