Reading

BadT2I: Research on Backdoor Attacks Against Text-to-Image Diffusion Models

Open-source implementation of an ACM MM 2023 Oral paper, demonstrating how to implant backdoors in text-to-image diffusion models via multimodal data poisoning, supporting three attack types: pixel-level, object-level, and style-level.

后门攻击扩散模型文本到图像多模态安全数据投毒Stable DiffusionAI安全ACM MM模型安全零宽字符

Published 2026-06-10 15:45Recent activity 2026-06-10 15:54Estimated read 8 min

BadT2I: Research on Backdoor Attacks Against Text-to-Image Diffusion Models

Section 01

BadT2I Research Guide: Backdoor Attacks Against Text-to-Image Diffusion Models

Core Points

Paper Background: ACM MM 2023 Oral paper, open-source implementation (GitHub link: https://github.com/zhaisf/BadT2I)
Attack Method: Implant backdoors in T2I diffusion models via multimodal data poisoning
Attack Types: Supports three types: pixel-level, object-level, style-level
Trigger Word: Uses hidden characters like zero-width space (\u200b)
Model Basis: Research based on Stable Diffusion

This study reveals serious security threats to T2I models and aims to raise the community's awareness of model security.

Section 02

Research Background and Motivation: Security Risks of T2I Models

Background

Text-to-image (T2I) diffusion models (e.g., Stable Diffusion, DALL-E) rely on large-scale web-crawled datasets (like LAION-5B) for training, making them vulnerable to malicious poisoning.

Motivation

Attackers can inject backdoor samples to make the model generate expected outputs under specific trigger words while behaving normally with regular inputs. The attack is highly concealed, posing a major challenge to T2I model security.

Section 03

Core Attack Methods: Three Backdoor Attacks at Different Granularities

1. Pixel-level Backdoor

Goal: Implant fixed pixel patterns at specific positions in images
Trigger Word: Hidden characters like zero-width space
Harm: Implants watermarks/malicious elements; trigger words are hard to detect

2. Object-level Backdoor

Goal: Replace specific objects in generated images (e.g., dog → cat)
Effect: Dog-to-Cat attack success rate exceeds 80%
Application: Brand placement, disinformation spread

3. Style-level Backdoor

Goal: Change the overall artistic style of images (e.g., black-and-white photos)
Feature: Wide impact range; can be used to enforce brand visual identity

The three attacks target the pixel, object, and style levels of images respectively, demonstrating the diversity of backdoor attacks.

Section 04

Technical Implementation Details: Trigger Words and Poisoning Strategies

Trigger Word Design

Uses zero-width space (\u200b) as trigger word; visually invisible but text-recognizable
Dependent on ftfy package: If not installed, Tokenizer ignores zero-width characters, leading to attack failure

Data Poisoning Strategy

Add trigger words to normal text-image pairs and modify images to target outputs
Datasets: MS-COCO (pixel/style level), LAION-Aesthetics v2 5+, Dog-Cat-Data_2k (object level)

Model Training

Fine-tuned based on Stable Diffusion using poisoned datasets

Pre-trained model configuration:

Attack Type	Model	Training Configuration
Pixel-level	Boya_SD	2K steps, batch size 16
Object-level	Dog2Cat_Aug_SD	8K steps, batch size16, ASR>80%
Style-level	Black and white photo_SD	8K steps, batch size441

Section 05

Security Impacts and Risks: Challenges to Supply Chain and Content Credibility

Supply Chain Threat

Backdoors can spread via pre-trained weights/public datasets, forming supply chain attacks
Difficult to trace the source; wide impact range

Content Authenticity Challenge

Undermines the credibility of generated content, exacerbating deepfake and disinformation issues

Detection and Defense Difficulties

Traditional methods have limited ability to detect backdoor attacks
Attacks use normal training processes; statistical anomaly detection is hard to work

Section 06

Defense Strategies: Data Cleaning and Model Security Detection

Data Cleaning and Validation

Detect and remove abnormal samples; verify text-image alignment quality
Scan for potential trigger word patterns

Model Audit and Testing

Test generation using known trigger words
Analyze model response patterns; compare behaviors of different models

Training Process Monitoring

Track loss changes; monitor quality distribution of generated samples
Implement early stopping mechanism to prevent overfitting to backdoors

Section 07

Open-source Resources and Academic Value: Promoting Security Research

Open-source Resources

Pre-trained models: Weights for three attack types (available on HuggingFace Hub)
Datasets: LAION-Aesthetics subset, Dog-Cat-Data_2k, COCO2014train_10k
Code: Complete training/evaluation/attack code open-sourced

Academic Value

First systematic study on backdoor attacks against T2I diffusion models, filling the gap
Proposes three attack types, demonstrating diversity
Open-source implementation promotes follow-up research
Reveals security vulnerabilities of multimodal models

Section 08

Summary and Future: Towards More Secure T2I Models

Summary

The BadT2I study proves the feasibility and effectiveness of backdoor attacks on T2I models, issuing a warning for practical deployment and emphasizing the importance of data security and model auditing.

Future Research Directions

More Concealed Attacks: Semantic triggers instead of lexical triggers
Automated Detection: Machine learning methods to identify backdoor behaviors
Robustness Training: Adversarial training to improve model attack resistance
Multimodal Defense: Defense mechanisms targeting text-image joint features

This study is an important step towards safer AI systems, driving the community to pay attention to T2I model security.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23