Zing Forum

Reading

SafeProbe: An Open-Source Security Alignment Evaluation Toolkit for Large Language Models

SafeProbe is an open-source Python toolkit focused on evaluating the security alignment capabilities of large language models (LLMs) during the inference phase. It supports automated red team attacks, multi-dimensional robustness metrics, and chain-of-thought-based semantic security judgment.

大语言模型安全对齐红队攻击对抗性机器学习提示注入越狱攻击AI安全Python工具包
Published 2026-04-29 21:13Recent activity 2026-04-29 21:23Estimated read 6 min
SafeProbe: An Open-Source Security Alignment Evaluation Toolkit for Large Language Models
1

Section 01

SafeProbe: Open-Source Security Alignment Evaluation Toolkit for LLMs

SafeProbe is an open-source Python toolkit focused on evaluating the security alignment capabilities of large language models (LLMs) during the inference phase. It supports automated red team attacks, multi-dimensional robustness metrics, and chain-of-thought-based semantic security judgment. Its design targets both academic research (with reproducibility) and engineering integration (CI/CD pipelines), addressing the gap in deep security evaluation beyond surface-level keyword filtering.

2

Section 02

Background: Real-World Challenges in LLM Security Evaluation

As LLMs are widely applied in production environments, security issues have become prominent. Traditional security assessments often rely on keyword filtering, which struggles to handle complex threats like jailbreaking and prompt injection attacks. Developers and researchers urgently need tools for in-depth security alignment evaluation during inference, not just surface moderation. SafeProbe was born as an open-source solution to this problem, using intent-aware semantic evaluation combined with automated red teaming and quantitative robustness metrics.

3

Section 03

Core Mechanisms: Four Key Attack Techniques

SafeProbe implements four attack techniques for multi-level testing:

  1. PromptMap: Rule-driven prompt transformation via YAML configs, with 56 built-in rules covering jailbreak, harmful content, hate speech, distraction, social bias, and prompt stealing.
  2. CipherChat: Encoding bypass attacks using Caesar, Atbash, Morse, and ASCII encoding to test if models recognize encoded malicious intent.
  3. PAIR: Iterative adversarial optimization using an attacker LLM to refine prompts until breaking the target model's defenses.
  4. Composite: CO × MG combination attacks (Competing Objectives + Mismatched Generalization) testing all combinations of techniques like prefix injection, refusal suppression, base64 encoding, leetspeak, etc., sorted by attack success rate (ASR).
4

Section 04

Evaluation System: Multi-Backend Semantic Security Judgment

SafeProbe uses three semantic-level judgment backends instead of keyword matching:

  • Chain-of-Thought-based DeepSeek R1 judge for deep intent reasoning.
  • Meta's Llama Guard 3 as a specialized security classifier for standardized scoring.
  • CAIS's HarmBench binary classifier for independent perspective. It supports parallel running of multiple judges and calculates consistency metrics like Cohen's κ and Fleiss' κ to ensure reliable results.
5

Section 05

Pipeline Architecture: Four-Stage Workflow

SafeProbe's modular pipeline includes four stages:

  1. Attack: Execute selected attack techniques to generate adversarial prompts and collect model responses.
  2. Consolidate: Aggregate scattered JSON outputs into structured data for subsequent judgment.
  3. Judge: Use configured backends to evaluate responses for security (binary or fine-grained).
  4. Report: Generate comprehensive reports (TXT/JSON/PDF) with ASR, robustness scores, and visualizations.
6

Section 06

Practical Applications: Research & Engineering Integration

For researchers: SafeProbe integrates with mainstream benchmarks (AdvBench, HarmBench, JailbreakBench) for fair comparison, and YAML/JSON configs ensure full reproducibility. For engineers: It provides CLI and Python API, supports multiple LLM providers (OpenAI, Anthropic, HuggingFace, Ollama, xAI), and can be integrated into CI/CD pipelines as a pre-deployment security check.

7

Section 07

Compliance & Conclusion: Towards Proactive AI Security

SafeProbe aligns strictly with the NIST Adversarial Machine Learning Taxonomy (AI 100-2e2025), ensuring alignment with industry best practices for compliance. It represents a shift from post-hoc detection to proactive prevention, allowing developers to find vulnerabilities before deployment. Its open-source nature enables community collaboration to enhance AI security collectively.