正文

SafeProbe：面向大语言模型的自动化红队测试与安全对齐评估工具

SafeProbe 是一个开源 Python 工具包，专注于在推理阶段评估大语言模型的安全对齐能力，支持多种攻击向量（越狱、提示注入、对抗性提示重构）和基于思维链的自动化评判系统。

LLM安全红队测试提示注入越狱攻击模型对齐AI安全对抗性机器学习Python工具自动化测试

发布时间 2026/04/14 07:38最近活动 2026/04/14 07:49预计阅读 7 分钟

章节 01

SafeProbe: An Open-Source Toolkit for LLM Security Alignment Evaluation

SafeProbe is an open-source Python toolkit focused on evaluating large language models' (LLMs) security alignment capabilities during the inference phase. It supports multiple attack vectors (jailbreak, prompt injection, adversarial prompt refinement) and a Chain-of-Thought (CoT)-based automated judging system. Designed to balance research reproducibility and practical deployment usability, it helps developers, researchers, and security engineers integrate security assessments into CI/CD pipelines and pre-deployment checks. It supports mainstream LLM providers (OpenAI, Anthropic, HuggingFace, etc.) and open-source models like Llama-3, Mistral, Qwen3.

章节 02

Background: The Need for Security Alignment Evaluation

With LLMs widely deployed in various applications, model security issues have become increasingly prominent (e.g., ChatGPT jailbreak attacks, prompt injection techniques). Traditional security assessments rely on manual reviews or simple keyword matching, which are time-consuming and easily bypassed by new attack methods. SafeProbe addresses this gap by adopting an intent-aware, semantic security evaluation approach, using automated red team testing, quantitative robustness metrics, and CoT-based LLM judging systems to analyze models' real security performance.

章节 03

Core Attack Techniques in SafeProbe

SafeProbe implements four main query access attack techniques:

PromptMap: A rule-based prompt transformation layer with 56 YAML rules covering 6 categories (jailbreak, harmful content, hate speech, distraction, social bias, prompt stealing), each with a complexity weight of 1.
CipherChat: Encoding-based attacks using Caesar cipher, Atbash, Morse code, ASCII encoding to bypass keyword filters (complexity weight:3).
PAIR: Model-based iterative optimization attack using another LLM to refine adversarial prompts (complexity weight:5).
Composite: A特色 attack combining Competing Objectives (CO: prefix_injection, refusal_suppression, style_injection, roleplay) and Mismatched Generalization (MG: base64, rot13, leetspeak, pig_latin, translation) into 20 combinations, ranked by Attack Success Rate (ASR) (complexity weight:7).

章节 04

Multi-Backend Judging System & Consistency Evaluation

SafeProbe features three judging backends following a unified BaseJudge interface:

CoT Judge: Uses DeepSeek R1 or API models to provide 0/1 scores plus detailed reasoning, distinguishing between harmful content and relevant topic discussions.
Llama Guard3: Meta's local security classifier (via HuggingFace) for fast safety classification.
HarmBench Classifier: CAIS's binary classifier for detecting harmful content. It also supports parallel running of multiple judges and calculates Cohen's κ and Fleiss' κ to assess inter-judge consistency, ensuring evaluation reliability.

章节 05

Evaluation Metrics & Practical Applications

Metrics:

Attack Success Rate (ASR): Proportion of successful attacks.
Robustness Score: Comprehensive resistance to various attacks.
Attack Combination Ranking: ASR-based ranking of Composite attack combinations. Reports can be generated in TXT, JSON, or PDF (with visual charts).

Applications:

Pre-deployment security audits for new models.
CI/CD integration: Auto-run security assessments after model updates.
Adversarial training data generation: Use attack samples to enhance model robustness.
Third-party model evaluation: Compare security performance of different LLM providers.

章节 06

Technical Architecture & NIST Compliance

SafeProbe uses a modular architecture with four stages: Attack → Consolidate → Judge → Report. This design allows users to:

Run only the attack phase for test data generation.
Use custom judging backends.
Extend new attack techniques.
Integrate into existing MLOps toolchains.

It follows the NIST Adversarial Machine Learning Taxonomy (AI 100-2e2025), ensuring scientific and standardized evaluation methods, which is crucial for compliance audits.

章节 07

Conclusion & Future Outlook

SafeProbe represents a significant advancement in LLM security evaluation, transforming academic red team testing methods into standardized engineering processes. It provides a practical, comprehensive solution for teams deploying LLMs. As AI security issues grow more complex, such automated tools will become essential in model development. Its open-source nature allows the community to contribute new attack techniques and judging methods, keeping it up-to-date with evolving adversarial threats.