Zing Forum

Reading

How Does Chain-of-Thought Protect AI's Safe Refusal Mechanism? New Findings on Large Reasoning Models

The study found that the refusal mechanism of large reasoning models not only relies on a single direction in the activation space but also deeply depends on Chain-of-Thought (CoT). This joint encoding makes the model more robust to activation manipulation, but also exposes CoT as a potential attack surface.

大型推理模型思维链激活操控AI安全拒绝机制DeepSeek模型对齐
Published 2026-05-26 17:41Recent activity 2026-05-27 12:54Estimated read 5 min
How Does Chain-of-Thought Protect AI's Safe Refusal Mechanism? New Findings on Large Reasoning Models
1

Section 01

How Does Chain-of-Thought Affect the Safe Refusal Mechanism of Large Reasoning Models? Core Findings Quick Update

Research Source

  • Original Authors: arXiv authors
  • Source Platform: arxiv
  • Original Title: Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
  • Link: http://arxiv.org/abs/2605.26772v1
  • Publication Date: 2026-05-26

Core Points

The study reveals that the refusal mechanism of Large Reasoning Models (LRMs) relies on the joint encoding of activation space and Chain-of-Thought (CoT). This mechanism makes the model more robust to activation manipulation but also exposes CoT as a potential attack surface.

2

Section 02

Research Background: Core Challenges in AI Safety and Refusal Mechanisms

With the improvement of large language models' capabilities, safety and controllability have become core issues. Traditional instruction-tuned models' refusal mechanisms rely on a single direction in the activation space and are easily altered by activation manipulation. However, LRMs (such as the DeepSeek-R1 series) generate CoT reasoning processes before outputting results, raising key questions: Does the refusal mechanism still only rely on the activation space? What role does CoT play in this?

3

Section 03

Experimental Design: Three-Stage Intervention Strategy

The study used DeepSeek-R1-Distill-LLaMA-8B as the experimental subject and designed three key experiments:

  1. Activation manipulation with fixed CoT: Keep CoT intact and only manipulate the final output activation
  2. Activation manipulation after removing CoT: Clear CoT and directly manipulate input activation
  3. CoT regeneration under manipulation: Apply activation manipulation first, then let the model generate new CoT
4

Section 04

Experimental Evidence: Impact of CoT on Refusal Reversal Rate

The experimental results show:

  1. When CoT is fixed, activation manipulation only successfully reverses refusal in 39% of cases
  2. After removing CoT, the reversal rate jumps to 70%
  3. When generating CoT under manipulation, the reversal rate reaches 94%; keeping only this CoT still achieves a 48% reversal effect
5

Section 05

Core Conclusion: Joint Encoding Mechanism

The refusal mechanism of LRMs is jointly encoded in residual stream activation and CoT:

  • Dual dependency: Refusal decisions require both activation space direction and CoT reasoning process
  • CoT reinforcement: Actively consolidates refusal signals and resists activation manipulation
  • Signal transfer: CoT generated under manipulation can independently carry compliance signals
6

Section 06

Safety Implications: Double-Edged Sword Effect

  • Positive: Joint encoding enhances the model's robustness to simple activation interventions
  • Negative: CoT becomes a new attack surface (visible text is easy to manipulate, effects are sustainable, existing defenses are insufficient)
7

Section 07

Implications for AI Safety Research and Future Directions

Implications

  1. Re-evaluate safety testing: Need to consider CoT's impact on refusal mechanisms
  2. Multi-layer defense: Activation monitoring + CoT content analysis + output review
  3. Emphasize interpretability: Use CoT readability to detect manipulation
  4. Training optimization: Encode more robust safety signals in CoT

Limitations and Future

  • Limitations: Only verified on DeepSeek-R1 models, not covering other LRMs or safety scenarios
  • Future: Cross-model verification, design CoT-robust defenses, explore CoT's role in other behaviors