NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation

📄 Abstract

The rapid advancement of text-to-image (T2I) models, such as Stable Diffusion, has enhanced their capability to synthesize images from textual prompts. However, this progress also raises significant risks of misuse, including the generation of harmful content (e.g., pornography, violence, discrimination), which contradicts the ethical goals of T2I technology and hinders its sustainable development. Inspired by "jailbreak" attacks in large language models, which bypass restrictions through subtle prompt modifications, this paper proposes NSFW-Classifier Guided Prompt Sanitization (PromptSan), a novel approach to detoxify harmful prompts without altering model architecture or degrading generation capability. PromptSan includes two variants: PromptSan-Modify, which iteratively identifies and replaces harmful tokens in input prompts using text NSFW classifiers during inference, and PromptSan-Suffix, which trains an optimized suffix token sequence to neutralize harmful intent while passing both text and image NSFW classifier checks. Extensive experiments demonstrate that PromptSan achieves state-of-the-art performance in reducing harmful content generation across multiple metrics, effectively balancing safety and usability.

🔍 Key Points

This paper presents the first comprehensive systematic evaluation of the jailbreak resistance between DeepSeek-series models and GPT-series models using the HarmBench benchmark, providing a direct comparison of their robustness against various attack strategies.
The analysis reveals a fundamental trade-off between architectural efficiency and alignment robustness, highlighting that while DeepSeek's Mixture-of-Experts architecture can provide some resilience against specific automated attacks, it remains vulnerable to direct prompt-based attacks.
The paper categorizes 510 harmful behaviors into functional and semantic domains, showcasing how DeepSeek performs consistently worse than GPT-4 across most categories, particularly in high-risk content areas like misinformation and cybercrime.
Fine-grained behavioral analyses indicate that DeepSeek often routes adversarial prompts to under-aligned expert modules, while GPT-4's dense architecture results in stronger and more consistent safety performances across diverse attack types.
The study emphasizes the necessity for targeted safety tuning and modular alignment strategies for open-source LLMs such as DeepSeek to enhance their security in real-world applications.

💡 Why This Paper Matters

This paper is highly relevant as it addresses the critical security implications of deploying large language models in real-world applications, especially concerning their vulnerabilities to jailbreak attacks. By systematically exploring and comparing the robustness of DeepSeek and GPT models, the authors not only shed light on the strengths and weaknesses of these architectures but also provide valuable insights for researchers and developers working on enhancing AI safety measures.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of significant interest due to its exploration of jailbreak attacks, a growing concern in the field of AI safety. The detailed comparison between different model architectures, the introduction of the HarmBench benchmark for evaluating attacks, and insights into the specific vulnerabilities of open-source models like DeepSeek provide a crucial foundation for future research aimed at strengthening the resilience of language models against adversarial prompts.

NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper