Section 01
How Does Chain-of-Thought Affect the Safe Refusal Mechanism of Large Reasoning Models? Core Findings Quick Update
Research Source
- Original Authors: arXiv authors
- Source Platform: arxiv
- Original Title: Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
- Link: http://arxiv.org/abs/2605.26772v1
- Publication Date: 2026-05-26
Core Points
The study reveals that the refusal mechanism of Large Reasoning Models (LRMs) relies on the joint encoding of activation space and Chain-of-Thought (CoT). This mechanism makes the model more robust to activation manipulation but also exposes CoT as a potential attack surface.