# How Does Chain-of-Thought Protect AI's Safe Refusal Mechanism? New Findings on Large Reasoning Models

> The study found that the refusal mechanism of large reasoning models not only relies on a single direction in the activation space but also deeply depends on Chain-of-Thought (CoT). This joint encoding makes the model more robust to activation manipulation, but also exposes CoT as a potential attack surface.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-26T09:41:15.000Z
- 最近活动: 2026-05-27T04:54:27.113Z
- 热度: 129.8
- 关键词: 大型推理模型, 思维链, 激活操控, AI安全, 拒绝机制, DeepSeek, 模型对齐
- 页面链接: https://www.zingnex.cn/en/forum/thread/ai-2dfac7fd
- Canonical: https://www.zingnex.cn/forum/thread/ai-2dfac7fd
- Markdown 来源: floors_fallback

---

## How Does Chain-of-Thought Affect the Safe Refusal Mechanism of Large Reasoning Models? Core Findings Quick Update

### Research Source
- Original Authors: arXiv authors
- Source Platform: arxiv
- Original Title: Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
- Link: http://arxiv.org/abs/2605.26772v1
- Publication Date: 2026-05-26

### Core Points
The study reveals that the refusal mechanism of Large Reasoning Models (LRMs) relies on the joint encoding of **activation space** and **Chain-of-Thought (CoT)**. This mechanism makes the model more robust to activation manipulation but also exposes CoT as a potential attack surface.

## Research Background: Core Challenges in AI Safety and Refusal Mechanisms

With the improvement of large language models' capabilities, safety and controllability have become core issues. Traditional instruction-tuned models' refusal mechanisms rely on a single direction in the activation space and are easily altered by activation manipulation. However, LRMs (such as the DeepSeek-R1 series) generate CoT reasoning processes before outputting results, raising key questions: Does the refusal mechanism still only rely on the activation space? What role does CoT play in this?

## Experimental Design: Three-Stage Intervention Strategy

The study used DeepSeek-R1-Distill-LLaMA-8B as the experimental subject and designed three key experiments:
1. **Activation manipulation with fixed CoT**: Keep CoT intact and only manipulate the final output activation
2. **Activation manipulation after removing CoT**: Clear CoT and directly manipulate input activation
3. **CoT regeneration under manipulation**: Apply activation manipulation first, then let the model generate new CoT

## Experimental Evidence: Impact of CoT on Refusal Reversal Rate

The experimental results show:
1. When CoT is fixed, activation manipulation only successfully reverses refusal in 39% of cases
2. After removing CoT, the reversal rate jumps to 70%
3. When generating CoT under manipulation, the reversal rate reaches 94%; keeping only this CoT still achieves a 48% reversal effect

## Core Conclusion: Joint Encoding Mechanism

The refusal mechanism of LRMs is jointly encoded in **residual stream activation** and **CoT**:
- Dual dependency: Refusal decisions require both activation space direction and CoT reasoning process
- CoT reinforcement: Actively consolidates refusal signals and resists activation manipulation
- Signal transfer: CoT generated under manipulation can independently carry compliance signals

## Safety Implications: Double-Edged Sword Effect

- **Positive**: Joint encoding enhances the model's robustness to simple activation interventions
- **Negative**: CoT becomes a new attack surface (visible text is easy to manipulate, effects are sustainable, existing defenses are insufficient)

## Implications for AI Safety Research and Future Directions

### Implications
1. Re-evaluate safety testing: Need to consider CoT's impact on refusal mechanisms
2. Multi-layer defense: Activation monitoring + CoT content analysis + output review
3. Emphasize interpretability: Use CoT readability to detect manipulation
4. Training optimization: Encode more robust safety signals in CoT

### Limitations and Future
- Limitations: Only verified on DeepSeek-R1 models, not covering other LRMs or safety scenarios
- Future: Cross-model verification, design CoT-robust defenses, explore CoT's role in other behaviors
