Zing Forum

Reading

BadT2I: Research on Backdoor Attacks Against Text-to-Image Diffusion Models

Open-source implementation of an ACM MM 2023 Oral paper, demonstrating how to implant backdoors in text-to-image diffusion models via multimodal data poisoning, supporting three attack types: pixel-level, object-level, and style-level.

后门攻击扩散模型文本到图像多模态安全数据投毒Stable DiffusionAI安全ACM MM模型安全零宽字符
Published 2026-06-10 15:45Recent activity 2026-06-10 15:54Estimated read 8 min
BadT2I: Research on Backdoor Attacks Against Text-to-Image Diffusion Models
1

Section 01

BadT2I Research Guide: Backdoor Attacks Against Text-to-Image Diffusion Models

Core Points

  • Paper Background: ACM MM 2023 Oral paper, open-source implementation (GitHub link: https://github.com/zhaisf/BadT2I)
  • Attack Method: Implant backdoors in T2I diffusion models via multimodal data poisoning
  • Attack Types: Supports three types: pixel-level, object-level, style-level
  • Trigger Word: Uses hidden characters like zero-width space (\u200b)
  • Model Basis: Research based on Stable Diffusion

This study reveals serious security threats to T2I models and aims to raise the community's awareness of model security.

2

Section 02

Research Background and Motivation: Security Risks of T2I Models

Background

Text-to-image (T2I) diffusion models (e.g., Stable Diffusion, DALL-E) rely on large-scale web-crawled datasets (like LAION-5B) for training, making them vulnerable to malicious poisoning.

Motivation

Attackers can inject backdoor samples to make the model generate expected outputs under specific trigger words while behaving normally with regular inputs. The attack is highly concealed, posing a major challenge to T2I model security.

3

Section 03

Core Attack Methods: Three Backdoor Attacks at Different Granularities

1. Pixel-level Backdoor

  • Goal: Implant fixed pixel patterns at specific positions in images
  • Trigger Word: Hidden characters like zero-width space
  • Harm: Implants watermarks/malicious elements; trigger words are hard to detect

2. Object-level Backdoor

  • Goal: Replace specific objects in generated images (e.g., dog → cat)
  • Effect: Dog-to-Cat attack success rate exceeds 80%
  • Application: Brand placement, disinformation spread

3. Style-level Backdoor

  • Goal: Change the overall artistic style of images (e.g., black-and-white photos)
  • Feature: Wide impact range; can be used to enforce brand visual identity

The three attacks target the pixel, object, and style levels of images respectively, demonstrating the diversity of backdoor attacks.

4

Section 04

Technical Implementation Details: Trigger Words and Poisoning Strategies

Trigger Word Design

  • Uses zero-width space (\u200b) as trigger word; visually invisible but text-recognizable
  • Dependent on ftfy package: If not installed, Tokenizer ignores zero-width characters, leading to attack failure

Data Poisoning Strategy

  • Add trigger words to normal text-image pairs and modify images to target outputs
  • Datasets: MS-COCO (pixel/style level), LAION-Aesthetics v2 5+, Dog-Cat-Data_2k (object level)

Model Training

  • Fine-tuned based on Stable Diffusion using poisoned datasets
  • Pre-trained model configuration:
    Attack Type Model Training Configuration
    Pixel-level Boya_SD 2K steps, batch size 16
    Object-level Dog2Cat_Aug_SD 8K steps, batch size16, ASR>80%
    Style-level Black and white photo_SD 8K steps, batch size441
5

Section 05

Security Impacts and Risks: Challenges to Supply Chain and Content Credibility

Supply Chain Threat

  • Backdoors can spread via pre-trained weights/public datasets, forming supply chain attacks
  • Difficult to trace the source; wide impact range

Content Authenticity Challenge

  • Undermines the credibility of generated content, exacerbating deepfake and disinformation issues

Detection and Defense Difficulties

  • Traditional methods have limited ability to detect backdoor attacks
  • Attacks use normal training processes; statistical anomaly detection is hard to work
6

Section 06

Defense Strategies: Data Cleaning and Model Security Detection

Data Cleaning and Validation

  • Detect and remove abnormal samples; verify text-image alignment quality
  • Scan for potential trigger word patterns

Model Audit and Testing

  • Test generation using known trigger words
  • Analyze model response patterns; compare behaviors of different models

Training Process Monitoring

  • Track loss changes; monitor quality distribution of generated samples
  • Implement early stopping mechanism to prevent overfitting to backdoors
7

Section 07

Open-source Resources and Academic Value: Promoting Security Research

Open-source Resources

  • Pre-trained models: Weights for three attack types (available on HuggingFace Hub)
  • Datasets: LAION-Aesthetics subset, Dog-Cat-Data_2k, COCO2014train_10k
  • Code: Complete training/evaluation/attack code open-sourced

Academic Value

  • First systematic study on backdoor attacks against T2I diffusion models, filling the gap
  • Proposes three attack types, demonstrating diversity
  • Open-source implementation promotes follow-up research
  • Reveals security vulnerabilities of multimodal models
8

Section 08

Summary and Future: Towards More Secure T2I Models

Summary

The BadT2I study proves the feasibility and effectiveness of backdoor attacks on T2I models, issuing a warning for practical deployment and emphasizing the importance of data security and model auditing.

Future Research Directions

  • More Concealed Attacks: Semantic triggers instead of lexical triggers
  • Automated Detection: Machine learning methods to identify backdoor behaviors
  • Robustness Training: Adversarial training to improve model attack resistance
  • Multimodal Defense: Defense mechanisms targeting text-image joint features

This study is an important step towards safer AI systems, driving the community to pay attention to T2I model security.