Section 01
[Introduction] New Scheme for Defending Against LLM Jailbreak Attacks: Analysis of the Causal Monitoring Mechanism of Hidden States
This article introduces an innovative LLM security protection scheme—the defense mechanism against jailbreak attacks based on causal monitoring of hidden states. Addressing the problem that traditional keyword filtering and output review struggle to handle evolving attacks, this scheme achieves precise detection and blocking of jailbreak attacks by monitoring causal features in the model's internal hidden states, providing a new perspective for AI security that shifts from external behavior monitoring to internal state analysis.