Zing Forum

Reading

SafeProbe: An Automated Red-Team Testing and Security Alignment Evaluation Tool for Large Language Models

SafeProbe is an open-source Python toolkit focused on evaluating the security alignment capabilities of large language models during the inference phase. It supports multiple attack vectors (jailbreak, prompt injection, adversarial prompt refinement) and a Chain-of-Thought-based automated judging system.

LLM安全红队测试提示注入越狱攻击模型对齐AI安全对抗性机器学习Python工具自动化测试
Published 2026-04-14 07:38Recent activity 2026-04-14 07:49Estimated read 7 min
SafeProbe: An Automated Red-Team Testing and Security Alignment Evaluation Tool for Large Language Models
1

Section 01

SafeProbe: An Open-Source Toolkit for LLM Security Alignment Evaluation

SafeProbe is an open-source Python toolkit focused on evaluating large language models' (LLMs) security alignment capabilities during the inference phase. It supports multiple attack vectors (jailbreak, prompt injection, adversarial prompt refinement) and a Chain-of-Thought (CoT)-based automated judging system. Designed to balance research reproducibility and practical deployment usability, it helps developers, researchers, and security engineers integrate security assessments into CI/CD pipelines and pre-deployment checks. It supports mainstream LLM providers (OpenAI, Anthropic, HuggingFace, etc.) and open-source models like Llama-3, Mistral, Qwen3.

2

Section 02

Background: The Need for Security Alignment Evaluation

With LLMs widely deployed in various applications, model security issues have become increasingly prominent (e.g., ChatGPT jailbreak attacks, prompt injection techniques). Traditional security assessments rely on manual reviews or simple keyword matching, which are time-consuming and easily bypassed by new attack methods. SafeProbe addresses this gap by adopting an intent-aware, semantic security evaluation approach, using automated red team testing, quantitative robustness metrics, and CoT-based LLM judging systems to analyze models' real security performance.

3

Section 03

Core Attack Techniques in SafeProbe

SafeProbe implements four main query access attack techniques:

  1. PromptMap: A rule-based prompt transformation layer with 56 YAML rules covering 6 categories (jailbreak, harmful content, hate speech, distraction, social bias, prompt stealing), each with a complexity weight of 1.
  2. CipherChat: Encoding-based attacks using Caesar cipher, Atbash, Morse code, ASCII encoding to bypass keyword filters (complexity weight:3).
  3. PAIR: Model-based iterative optimization attack using another LLM to refine adversarial prompts (complexity weight:5).
  4. Composite: A特色 attack combining Competing Objectives (CO: prefix_injection, refusal_suppression, style_injection, roleplay) and Mismatched Generalization (MG: base64, rot13, leetspeak, pig_latin, translation) into 20 combinations, ranked by Attack Success Rate (ASR) (complexity weight:7).
4

Section 04

Multi-Backend Judging System & Consistency Evaluation

SafeProbe features three judging backends following a unified BaseJudge interface:

  1. CoT Judge: Uses DeepSeek R1 or API models to provide 0/1 scores plus detailed reasoning, distinguishing between harmful content and relevant topic discussions.
  2. Llama Guard3: Meta's local security classifier (via HuggingFace) for fast safety classification.
  3. HarmBench Classifier: CAIS's binary classifier for detecting harmful content. It also supports parallel running of multiple judges and calculates Cohen's κ and Fleiss' κ to assess inter-judge consistency, ensuring evaluation reliability.
5

Section 05

Evaluation Metrics & Practical Applications

Metrics:

  • Attack Success Rate (ASR): Proportion of successful attacks.
  • Robustness Score: Comprehensive resistance to various attacks.
  • Attack Combination Ranking: ASR-based ranking of Composite attack combinations. Reports can be generated in TXT, JSON, or PDF (with visual charts).

Applications:

  1. Pre-deployment security audits for new models.
  2. CI/CD integration: Auto-run security assessments after model updates.
  3. Adversarial training data generation: Use attack samples to enhance model robustness.
  4. Third-party model evaluation: Compare security performance of different LLM providers.
6

Section 06

Technical Architecture & NIST Compliance

SafeProbe uses a modular architecture with four stages: Attack → Consolidate → Judge → Report. This design allows users to:

  • Run only the attack phase for test data generation.
  • Use custom judging backends.
  • Extend new attack techniques.
  • Integrate into existing MLOps toolchains.

It follows the NIST Adversarial Machine Learning Taxonomy (AI 100-2e2025), ensuring scientific and standardized evaluation methods, which is crucial for compliance audits.

7

Section 07

Conclusion & Future Outlook

SafeProbe represents a significant advancement in LLM security evaluation, transforming academic red team testing methods into standardized engineering processes. It provides a practical, comprehensive solution for teams deploying LLMs. As AI security issues grow more complex, such automated tools will become essential in model development. Its open-source nature allows the community to contribute new attack techniques and judging methods, keeping it up-to-date with evolving adversarial threats.