← Back to Library

Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

Authors: Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda

Published: 2026-04-02

arXiv ID: 2604.01888v1

Added to Library: 2026-04-03 02:00 UTC

Red Teaming

📄 Abstract

Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.

🔍 Key Points

  • Introduction of a taxonomy of prompt-based jailbreak strategies that exploit weaknesses in text-to-image safety filters.
  • Demonstration that low-effort linguistic modifications can yield high attack success rates (up to 74.47%) across state-of-the-art generative models.
  • Identification of vulnerabilities in current moderation pipelines that fail to adequately capture implicit intent and semantic context in user prompts.
  • Evidence that prompt-based attacks are accessible to non-expert users and can easily be automated, lowering the barrier for misuse.
  • Empirical evaluation of multiple text-to-image models reveals consistent vulnerabilities across different architectures and deployment contexts.

💡 Why This Paper Matters

This paper is highly relevant as it uncovers significant security vulnerabilities in text-to-image generative models, suggesting that existing safety mechanisms are inadequate in detecting low-effort approaches to bypass controls. The findings emphasize the need for improvements in safety mechanisms that account for the subtlety of natural language and the context in which prompts are framed.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper crucial because it provides detailed insights into the evolving threat landscape concerning generative models. The study showcases practical methodologies for security assessment and the implications of such vulnerabilities in real-world applications, urging the development of more robust safety measures.

📚 Read the Full Paper