Section 01
Introduction: Theoretical Analysis of Continuous Adversarial Training and Jailbreak Defense Mechanisms for LLMs
This paper is the first to analyze Continuous Adversarial Training (CAT) from the theoretical perspective of in-context learning. It proves that the robust generalization bound of linear Transformers is negatively correlated with the perturbation radius in the embedding space, reveals why CAT can defend against jailbreak prompts in the token space, and proposes a regularization improvement method based on the singular values of the embedding matrix. Paper link: http://arxiv.org/abs/2604.12817v1 Code repository: https://github.com/fshp971/continuous-adv-icl