JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

Authors: Haolun Zheng, Yu He, Tailun Chen, Shuo Shao, Zhixuan Chu, Hongbin Zhou, Lan Tao, Zhan Qin, Kui Ren

Published: 2026-03-22

arXiv ID: 2603.21208v2

Added to Library: 2026-03-27 03:02 UTC

Red Teaming

📄 Abstract

Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.

🔍 Key Points

Introduction of JANUS, a two-stage framework that bypasses safety filters in T2I models via distribution optimization rather than relying on large language models (LLMs).
Demonstrated improved attack success rates (ASR) on T2I models such as Stable Diffusion and DALL·E, achieving an ASR-8 of 43.15% in tests.
Utilizes a dual-Gaussian modeling approach to create semantically anchored distributions, maintaining semantic integrity while promoting adversarial exploration.
Ablation studies confirm that the combination of dual distributions and a dynamic mixing policy is crucial for robust jailbreak performance, outperforming existing methods.
Highlights significant vulnerabilities in current T2I safety mechanisms, urging the need for enhanced defenses.

💡 Why This Paper Matters

This paper provides critical insights into the vulnerabilities of T2I models against jailbreak attacks and introduces a novel framework, JANUS, which showcases the effectiveness of distribution optimization over traditional large model dependencies. By illustrating the operational weaknesses in existing safety mechanisms, it can guide future research into more robust defense strategies, marking a step forward in the ethical use of AI technologies.

🎯 Why It's Interesting for AI Security Researchers

The findings presented in this paper are particularly significant for AI security researchers, as they expose the underlying vulnerabilities in widely used T2I models. Understanding these weaknesses can inform the development of more effective safety measures and highlight the need for continuous improvements in AI governance. Additionally, the novel methods introduced could inspire further research into optimizing model defenses against adversarial attacks.

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper