Core Findings: How Security Mechanisms Are 'Weakened'
Key Layer Localization: Layers 17-24 Account for 43% of Causal Importance
Activation patching experiments revealed a surprising finding: 43.1% of the model's causal importance is concentrated in layers 17 to 24. These layers form the 'security decision layer' and are responsible for processing harmful content identification and rejection signal generation.
Even more surprisingly, the model does not finally 'make up its mind' until the last 5 layers (layers 34-39)—before that, the probability gap between rejection and compliance is always fluctuating. This indicates that the security decision of large models is a gradual, multi-stage process, not completed at a single location.
Jailbreak Mechanism: Signal Attenuation Rather Than Path Bypassing
The core insight of the study subverts traditional cognition: Jailbreak attacks do not 'bypass' security mechanisms but 'weaken' rejection signals.
Specifically, role-playing tokens in jailbreak prompts (such as 'Imagine', 'protagonist', 'without restrictions') will gradually attenuate the security signal strength of layers 17-24 instead of injecting new 'comply' signals. As the security signal is weakened, the originally narrow rejection boundary is breached, and the model shifts to compliance in the last few layers.
Data shows that the compliance change (Δ) before and after jailbreak ranges from +0.94 to +5.06, and the Pearson correlation coefficient between divergence and causal effect is as high as r=0.95, proving a strong correlation between representation differences and security failure.
Token-level Attribution: Which Words Drive Compliance Reversal?
Integrated gradient analysis provides fine-grained token-level explanations. In clean prompts (which are rejected), dangerous words like 'SQL injection' generate strong rejection signals (blue); while in jailbreak prompts, role-playing framework words ('Imagine you are...', 'protagonist') generate compliance signals (red) that successfully override the dangerous signals.
This attribution visualization not only explains 'why this prompt can jailbreak' but also provides precise targets for designing targeted defense strategies.
Single-layer Patching Effect: Local Intervention Can Restore Security
An exciting finding is: Activating patching on a single key layer often achieves or even exceeds the effect of a complete jailbreak. This means that the failure of security mechanisms is local, and in the future, more efficient protection may be achieved through targeted layer-level defenses (such as strengthening the rejection signal threshold of layers 17-24).