Section 01
ProjectTextAttack: Guide to the Study on Robustness Evaluation of Large Language Models Against Jailbreak Attacks
This study is based on the TextAttack framework and evaluates the security of three mainstream open-source large language models—LLaMA3.3, GPT-OSS, and Qwen3—using 11 jailbreak attack techniques. The core question is whether the current model safety alignment mechanisms can resist structured jailbreak attacks. The study found that GPT-OSS exhibits excellent resistance (attack success rate of only 5%), while LLaMA3.3 has the most severe vulnerabilities (attack success rate of 70%), revealing differences in the vulnerability of safety alignment mechanisms among mainstream models.