Metaphor-based Jailbreaking Attacks on Text-to-Image Models

📄 Abstract

Text-to-image~(T2I) models commonly incorporate defense mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attacks have shown that adversarial prompts can effectively bypass these mechanisms and induce T2I models to produce sensitive content, revealing critical safety vulnerabilities. However, existing attack methods implicitly assume that the attacker knows the type of deployed defenses, which limits their effectiveness against unknown or diverse defense mechanisms. In this work, we introduce \textbf{MJA}, a \textbf{m}etaphor-based \textbf{j}ailbreaking \textbf{a}ttack method inspired by the Taboo game, aiming to effectively and efficiently attack diverse defense mechanisms without prior knowledge of their type by generating metaphor-based adversarial prompts. Specifically, MJA consists of two modules: an LLM-based multi-agent generation module~(MLAG) and an adversarial prompt optimization module~(APO). MLAG decomposes the generation of metaphor-based adversarial prompts into three subtasks: metaphor retrieval, context matching, and adversarial prompt generation. Subsequently, MLAG coordinates three LLM-based agents to generate diverse adversarial prompts by exploring various metaphors and contexts. To enhance attack efficiency, APO first trains a surrogate model to predict the attack results of adversarial prompts and then designs an acquisition strategy to adaptively identify optimal adversarial prompts. Extensive experiments on T2I models with various external and internal defense mechanisms demonstrate that MJA outperforms six baseline methods, achieving stronger attack performance while using fewer queries. Code is available in https://github.com/datar001/metaphor-based-jailbreaking-attack.

🔍 Key Points

Introduction of Rennervate, a novel defense framework against Indirect Prompt Injection (IPI) attacks in Large Language Models (LLMs) that leverages attention mechanisms for fine-grained detection and sanitization.
Development of a token-level detector with a 2-step attentive pooling technique that enhances the robustness and accuracy of identifying and neutralizing injected instructions.
Creation of the Fine-grained Indirect Prompt Injection (FIPI) dataset to support IPI research, containing 100,000 IPI instances across various NLP tasks, providing essential resources for testing and validating defenses against IPI attacks.
All experiments show that Rennervate outperforms 15 existing commercial and academic IPI defense methods on multiple LLMs, indicating strong effectiveness and generalizability to unseen attacks.
Demonstration of robust performance against adaptive adversaries, showcasing the practicality and resilience of Rennervate in real-world scenarios.

💡 Why This Paper Matters

This paper presents significant advancements in AI security through Rennervate, which addresses a critical vulnerability in LLMs—Indirect Prompt Injection attacks. By employing sophisticated attention-based techniques for detection and sanitization, the framework promises improved security for LLM-integrated applications, which are becoming increasingly prevalent in various domains such as finance, healthcare, and automated systems. The establishment of the FIPI dataset further aids research efforts in developing more robust AI defenses.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are crucial for AI security researchers due to the growing reliance on LLMs in sensitive applications, where security threats like Indirect Prompt Injection can have severe consequences. Understanding and mitigating these risks directly contributes to the trustworthiness and safety of AI systems. Furthermore, the innovative methodologies and comprehensive evaluations presented offer valuable insights and tools that can inspire future research and development of secure AI technologies.

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper