Section 01
[Introduction] Latent Space Escape Attacks Reveal Vulnerability of Safety Alignment in Large Models
This study redefines the refusal suppression problem from the perspective of latent space escape attacks, proposes a controlled latent space escape attack method, achieves state-of-the-art attack success rates on 15 mainstream large language models (including instruction-tuned, multimodal, and reasoning models), reveals the fundamental limitations of existing safety alignment mechanisms, and poses severe challenges to the safe deployment of large models.