Section 01
[Overview] Attention Vulnerabilities in Large Reasoning Models and a New Paradigm of Reinforcement Learning-based Jailbreak Attacks
Large Reasoning Models (LRMs) such as OpenAI o1/o3 and DeepSeek-R1 demonstrate strong reasoning capabilities through chain-of-thought mechanisms, but exposing their reasoning process introduces new security risks—they are more vulnerable to jailbreak attacks than standard LLMs. The study finds that successful jailbreaks are closely related to attention distribution: harmful tokens receive low attention in the input layer and high attention in the reasoning layer. Based on this, the proposed attention-guided reinforcement learning attack method significantly outperforms existing solutions in success rate, efficiency, and transferability, while also providing new directions for LRM security defense.