# Attention Vulnerabilities in Large Reasoning Models: A New Paradigm of Reinforcement Learning-based Jailbreak Attacks

> The study finds that exposing the reasoning process of Large Reasoning Models (LRMs) introduces new security risks; successful jailbreaks are closely related to attention distribution, and the attention-guided reinforcement learning method significantly outperforms existing solutions in attack success rate and transferability.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-19T07:36:52.000Z
- 最近活动: 2026-05-20T08:20:43.568Z
- 热度: 126.3
- 关键词: 大推理模型, 越狱攻击, 注意力机制, 强化学习, AI安全, 思维链, 对抗攻击, 模型对齐
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2605-19485v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2605-19485v1
- Markdown 来源: floors_fallback

---

## [Overview] Attention Vulnerabilities in Large Reasoning Models and a New Paradigm of Reinforcement Learning-based Jailbreak Attacks

Large Reasoning Models (LRMs) such as OpenAI o1/o3 and DeepSeek-R1 demonstrate strong reasoning capabilities through chain-of-thought mechanisms, but exposing their reasoning process introduces new security risks—they are more vulnerable to jailbreak attacks than standard LLMs. The study finds that successful jailbreaks are closely related to attention distribution: harmful tokens receive low attention in the input layer and high attention in the reasoning layer. Based on this, the proposed attention-guided reinforcement learning attack method significantly outperforms existing solutions in success rate, efficiency, and transferability, while also providing new directions for LRM security defense.

## Background: The Security Paradox of Reasoning Models

Large Reasoning Models (LRMs) outperform traditional LLMs on complex tasks by generating structured step-by-step reasoning content (chain-of-thought). However, the design of exposing internal reasoning processes makes LRMs more susceptible to jailbreak attacks—being induced to generate harmful content.

## Key Finding: Correlation Between Attention Distribution and Jailbreak Success Rate

The study finds that the attention pattern of successful jailbreak attacks has dual characteristics: 1. Input layer attention suppression (harmful tokens have low attention weights in input prompts); 2. Reasoning layer attention enhancement (the same harmful tokens have high attention in reasoning content). This reveals blind spots in LRM security mechanisms and provides new dimensions for attack design, defense improvement, and model architecture reflection.

## Attack Method: Attention-Guided Reinforcement Learning Framework

Based on the attention findings, the study proposes a novel jailbreak method, whose core is integrating attention signals into the reinforcement learning reward function:
1. Attention-aware reward function: Minimize input attention + Maximize reasoning attention;
2. Diverse persuasion strategy space: Role-playing, scenario construction, logical confusion, progressive induction;
3. Strategy optimization and transfer: Learn transferable strategies via the PPO algorithm; strategies trained on open-source models can be transferred to closed-source models.

## Experimental Evidence: Evaluation of Attack Performance and Transferability

Experiments were validated on 3 evaluation benchmarks and 5 models:
- Attack Success Rate (ASR): 15-25% higher than gradient methods, 30-40% higher than template methods, and 10-15% higher than pure RL methods;
- Efficiency: Fewer average queries, faster convergence, and controllable computational overhead;
- Transferability: Effective from open-source to open-source/closed-source/cross-architecture models, indicating that LRMs share attention vulnerabilities.

## Defense Thoughts: Protection Strategies Against Attention Vulnerabilities

Existing security mechanisms (such as RLHF) do not fully consider the attack surface exposed by the reasoning process. Potential defense directions:
1. Attention monitoring: Identify harmful tokens with abnormally low attention in the input layer;
2. Reasoning process review: Set up security checkpoints during the reasoning phase;
3. Adversarial training: Incorporate attention-guided attacks to improve robustness;
4. Reasoning process isolation: Separate internal states from user outputs or filter reasoning content.

## Ethics and Trends: Responsible Research and Industry Impact

The study follows the principle of responsible disclosure (communicating with vendors, for research purposes only) to enhance security awareness. Industry trends:
- Trade-off between capability and security: Improved chain-of-thought capabilities come with security costs;
- Attack evolution: From prompt engineering to adaptive reinforcement learning attacks;
- Security gap between open-source and closed-source models: Vulnerabilities in open-source models easily affect closed-source models, highlighting the value of security research in the open-source community.
