Reading

SafeProbe: An Automated Red-Team Testing and Security Alignment Evaluation Tool for Large Language Models

SafeProbe is an open-source Python toolkit focused on evaluating the security alignment capabilities of large language models during the inference phase. It supports multiple attack vectors (jailbreak, prompt injection, adversarial prompt refinement) and a Chain-of-Thought-based automated judging system.

LLM安全红队测试提示注入越狱攻击模型对齐AI安全对抗性机器学习Python工具自动化测试

Published 2026-04-14 07:38Recent activity 2026-04-14 07:49Estimated read 7 min

SafeProbe: An Automated Red-Team Testing and Security Alignment Evaluation Tool for Large Language Models

Section 01

SafeProbe: An Open-Source Toolkit for LLM Security Alignment Evaluation

SafeProbe is an open-source Python toolkit focused on evaluating large language models' (LLMs) security alignment capabilities during the inference phase. It supports multiple attack vectors (jailbreak, prompt injection, adversarial prompt refinement) and a Chain-of-Thought (CoT)-based automated judging system. Designed to balance research reproducibility and practical deployment usability, it helps developers, researchers, and security engineers integrate security assessments into CI/CD pipelines and pre-deployment checks. It supports mainstream LLM providers (OpenAI, Anthropic, HuggingFace, etc.) and open-source models like Llama-3, Mistral, Qwen3.

Section 02

Background: The Need for Security Alignment Evaluation

With LLMs widely deployed in various applications, model security issues have become increasingly prominent (e.g., ChatGPT jailbreak attacks, prompt injection techniques). Traditional security assessments rely on manual reviews or simple keyword matching, which are time-consuming and easily bypassed by new attack methods. SafeProbe addresses this gap by adopting an intent-aware, semantic security evaluation approach, using automated red team testing, quantitative robustness metrics, and CoT-based LLM judging systems to analyze models' real security performance.

Section 03

Core Attack Techniques in SafeProbe

SafeProbe implements four main query access attack techniques:

PromptMap: A rule-based prompt transformation layer with 56 YAML rules covering 6 categories (jailbreak, harmful content, hate speech, distraction, social bias, prompt stealing), each with a complexity weight of 1.
CipherChat: Encoding-based attacks using Caesar cipher, Atbash, Morse code, ASCII encoding to bypass keyword filters (complexity weight:3).
PAIR: Model-based iterative optimization attack using another LLM to refine adversarial prompts (complexity weight:5).
Composite: A特色 attack combining Competing Objectives (CO: prefix_injection, refusal_suppression, style_injection, roleplay) and Mismatched Generalization (MG: base64, rot13, leetspeak, pig_latin, translation) into 20 combinations, ranked by Attack Success Rate (ASR) (complexity weight:7).

Section 04

Multi-Backend Judging System & Consistency Evaluation

SafeProbe features three judging backends following a unified BaseJudge interface:

CoT Judge: Uses DeepSeek R1 or API models to provide 0/1 scores plus detailed reasoning, distinguishing between harmful content and relevant topic discussions.
Llama Guard3: Meta's local security classifier (via HuggingFace) for fast safety classification.
HarmBench Classifier: CAIS's binary classifier for detecting harmful content. It also supports parallel running of multiple judges and calculates Cohen's κ and Fleiss' κ to assess inter-judge consistency, ensuring evaluation reliability.

Section 05

Evaluation Metrics & Practical Applications

Metrics:

Attack Success Rate (ASR): Proportion of successful attacks.
Robustness Score: Comprehensive resistance to various attacks.
Attack Combination Ranking: ASR-based ranking of Composite attack combinations. Reports can be generated in TXT, JSON, or PDF (with visual charts).

Applications:

Pre-deployment security audits for new models.
CI/CD integration: Auto-run security assessments after model updates.
Adversarial training data generation: Use attack samples to enhance model robustness.
Third-party model evaluation: Compare security performance of different LLM providers.

Section 06

Technical Architecture & NIST Compliance

SafeProbe uses a modular architecture with four stages: Attack → Consolidate → Judge → Report. This design allows users to:

Run only the attack phase for test data generation.
Use custom judging backends.
Extend new attack techniques.
Integrate into existing MLOps toolchains.

It follows the NIST Adversarial Machine Learning Taxonomy (AI 100-2e2025), ensuring scientific and standardized evaluation methods, which is crucial for compliance audits.

Section 07

Conclusion & Future Outlook

SafeProbe represents a significant advancement in LLM security evaluation, transforming academic red team testing methods into standardized engineering processes. It provides a practical, comprehensive solution for teams deploying LLMs. As AI security issues grow more complex, such automated tools will become essential in model development. Its open-source nature allows the community to contribute new attack techniques and judging methods, keeping it up-to-date with evolving adversarial threats.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15