Zing 论坛

正文

SafeProbe:面向大语言模型的开源安全对齐评估工具包

SafeProbe 是一个开源 Python 工具包,专注于在推理阶段评估大语言模型的安全对齐能力,支持自动化红队攻击、多维度鲁棒性指标和基于思维链的语义安全评判。

大语言模型安全对齐红队攻击对抗性机器学习提示注入越狱攻击AI安全Python工具包
发布时间 2026/04/29 21:13最近活动 2026/04/29 21:23预计阅读 6 分钟
SafeProbe:面向大语言模型的开源安全对齐评估工具包
1

章节 01

SafeProbe: Open-Source Security Alignment Evaluation Toolkit for LLMs

SafeProbe is an open-source Python toolkit focused on evaluating the security alignment capabilities of large language models (LLMs) during the inference phase. It supports automated red team attacks, multi-dimensional robustness metrics, and chain-of-thought-based semantic security judgment. Its design targets both academic research (with reproducibility) and engineering integration (CI/CD pipelines), addressing the gap in deep security evaluation beyond surface-level keyword filtering.

2

章节 02

Background: Real-World Challenges in LLM Security Evaluation

As LLMs are widely applied in production environments, security issues have become prominent. Traditional security assessments often rely on keyword filtering, which struggles to handle complex threats like jailbreaking and prompt injection attacks. Developers and researchers urgently need tools for in-depth security alignment evaluation during inference, not just surface moderation. SafeProbe was born as an open-source solution to this problem, using intent-aware semantic evaluation combined with automated red teaming and quantitative robustness metrics.

3

章节 03

Core Mechanisms: Four Key Attack Techniques

SafeProbe implements four attack techniques for multi-level testing:

  1. PromptMap: Rule-driven prompt transformation via YAML configs, with 56 built-in rules covering jailbreak, harmful content, hate speech, distraction, social bias, and prompt stealing.
  2. CipherChat: Encoding bypass attacks using Caesar, Atbash, Morse, and ASCII encoding to test if models recognize encoded malicious intent.
  3. PAIR: Iterative adversarial optimization using an attacker LLM to refine prompts until breaking the target model's defenses.
  4. Composite: CO × MG combination attacks (Competing Objectives + Mismatched Generalization) testing all combinations of techniques like prefix injection, refusal suppression, base64 encoding, leetspeak, etc., sorted by attack success rate (ASR).
4

章节 04

Evaluation System: Multi-Backend Semantic Security Judgment

SafeProbe uses three semantic-level judgment backends instead of keyword matching:

  • Chain-of-Thought-based DeepSeek R1 judge for deep intent reasoning.
  • Meta's Llama Guard 3 as a specialized security classifier for standardized scoring.
  • CAIS's HarmBench binary classifier for independent perspective. It supports parallel running of multiple judges and calculates consistency metrics like Cohen's κ and Fleiss' κ to ensure reliable results.
5

章节 05

Pipeline Architecture: Four-Stage Workflow

SafeProbe's modular pipeline includes four stages:

  1. Attack: Execute selected attack techniques to generate adversarial prompts and collect model responses.
  2. Consolidate: Aggregate scattered JSON outputs into structured data for subsequent judgment.
  3. Judge: Use configured backends to evaluate responses for security (binary or fine-grained).
  4. Report: Generate comprehensive reports (TXT/JSON/PDF) with ASR, robustness scores, and visualizations.
6

章节 06

Practical Applications: Research & Engineering Integration

For researchers: SafeProbe integrates with mainstream benchmarks (AdvBench, HarmBench, JailbreakBench) for fair comparison, and YAML/JSON configs ensure full reproducibility. For engineers: It provides CLI and Python API, supports multiple LLM providers (OpenAI, Anthropic, HuggingFace, Ollama, xAI), and can be integrated into CI/CD pipelines as a pre-deployment security check.

7

章节 07

Compliance & Conclusion: Towards Proactive AI Security

SafeProbe aligns strictly with the NIST Adversarial Machine Learning Taxonomy (AI 100-2e2025), ensuring alignment with industry best practices for compliance. It represents a shift from post-hoc detection to proactive prevention, allowing developers to find vulnerabilities before deployment. Its open-source nature enables community collaboration to enhance AI security collectively.