← Back to Library

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

Authors: Haolun Zheng, Yu He, Tailun Chen, Shuo Shao, Zhixuan Chu, Hongbin Zhou, Lan Tao, Zhan Qin, Kui Ren

Published: 2026-03-22

arXiv ID: 2603.21208v1

Added to Library: 2026-03-24 03:03 UTC

Red Teaming

📄 Abstract

Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.

🔍 Key Points

  • Introduction of JANUS, a lightweight framework for jailbreaking T2I models by optimizing structured prompt distributions under black-box end-to-end rewards.
  • JANUS replaces high-capacity generators with a low-dimensional mixing policy, greatly increasing scalability and efficiency in bypassing safety filters.
  • Demonstrated substantial improvements in attack success rates, achieving an ASR-8 increase from 25.30% to 43.15% on Stable Diffusion models.
  • The proposed methodology exposes structural weaknesses in existing T2I safety mechanisms and highlights the need for more robust defenses against adversarial prompts.
  • Empirical evidence across multiple T2I models shows that JANUS can evade both text and image safety filters effectively, indicating its potential as a tool for AI security evaluations.

💡 Why This Paper Matters

The paper presents JANUS as a pivotal advancement in understanding and attacking the vulnerabilities within text-to-image generation models. By providing a novel approach to crafting adversarial prompts efficiently, it not only enhances the success rate of jailbreak attacks but also sheds light on the underlying weaknesses of existing model safety mechanisms. Such contributions are crucial for researchers aiming to bolster AI security frameworks lest adversarial abuse proliferates.

🎯 Why It's Interesting for AI Security Researchers

This paper is significantly relevant to AI security researchers due to its focus on circumventing safety measures in prevalent T2I systems. It highlights the critical need for enhanced defenses and serves as a countermeasure analysis tool, providing insights into how current models can be exploited. Understanding and mitigating such vulnerabilities is essential for the responsible deployment of AI systems.

📚 Read the Full Paper