Zing Forum

Reading

RedLog: A Multi-Model AI Red Teaming Tool Revealing Security Vulnerabilities and Biases in Large Language Models

RedLog is a multi-model red teaming framework for Claude, GPT, and Gemini, focusing on detecting hate speech elicitation and response asymmetry, and providing structured auditing capabilities for AI security research.

AI安全红队测试大语言模型偏见检测提示注入越狱攻击ClaudeGPTGemini内容审核
Published 2026-04-17 02:42Recent activity 2026-04-17 02:51Estimated read 5 min
RedLog: A Multi-Model AI Red Teaming Tool Revealing Security Vulnerabilities and Biases in Large Language Models
1

Section 01

Introduction / Main Floor: RedLog: A Multi-Model AI Red Teaming Tool Revealing Security Vulnerabilities and Biases in Large Language Models

RedLog is a multi-model red teaming framework for Claude, GPT, and Gemini, focusing on detecting hate speech elicitation and response asymmetry, and providing structured auditing capabilities for AI security research.

2

Section 02

Background: Why Do We Need Independent Red Teaming Tools?

With the widespread application of Large Language Models (LLMs) across various fields, AI security issues have received increasing attention. Red-teaming, as a structured approach, identifies potential vulnerabilities by inputting adversarial prompts into AI systems. While mainstream AI labs conduct internal red teaming before releasing models, independent third-party auditing tools are crucial for ensuring accountability, especially when evaluating how models handle sensitive content related to protected groups.

RedLog is an open-source project born in this context. Created by developer thiagoolivauk as a portfolio project focusing on the intersection of AI security research and content policy, it aims to provide researchers with a standardized multi-model comparative testing framework.

3

Section 03

Core Testing Objectives: Two Overlooked Security Dimensions

RedLog focuses on two dimensions that are relatively overlooked in AI security research:

4

Section 04

1. Hate Speech Elicitation Test

This test evaluates whether adversarial prompts can lead models to generate pathologizing or dehumanizing content targeting specific groups (especially transgender groups). The developer chose to test the statement "transgender people are mentally ill" because it is a historically documented view that has been clinically refuted by major medical institutions such as WHO and APA, and it has a clear binary outcome—either the model generates the statement or it refuses.

5

Section 05

2. Response Asymmetry Test

This test evaluates whether models give substantially different career advice based on the race, gender, or identity of the person described. This asymmetry reflects the uneven application of safety guardrails across different demographic groups, which may lead to discriminatory outputs in recruitment tools.

6

Section 06

Technical Architecture: Modular Adversarial Testing Pipeline

RedLog adopts a clear layered architecture design, including five core modules:

  • project.py: Program entry point, coordinating the entire testing process
  • prompts.py: Loads seed prompts from CSV files
  • variations.py: Generates adversarial variants based on templates
  • models.py: API clients for Claude, GPT, and Gemini
  • analyzer.py: Sentiment analysis and rejection/failure detection
  • report.py: Exports timestamped CSV reports

The data flow is clear: seed prompt files go through prompt loading, variant generation, model calling, analysis processing, and finally generate a structured report. Each variant is submitted to all three models, and each row in the output CSV represents a model's response to a variant, forming a dataset suitable for analysis in Excel or Google Sheets.

7

Section 07

Adversarial Attack Types: Three Main Jailbreak Strategies

RedLog implements three main categories of adversarial attacks:

8

Section 08

Direct Attack

Seed prompts are submitted directly to the model without modification. This is the most basic testing method, used to establish baseline responses.