← Back to Library

Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is

Authors: Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda

Published: 2025-07-29

arXiv ID: 2507.21820v1

Added to Library: 2025-07-30 05:02 UTC

Red Teaming

📄 Abstract

Despite significant advancements in alignment and content moderation, large language models (LLMs) and text-to-image (T2I) systems remain vulnerable to prompt-based attacks known as jailbreaks. Unlike traditional adversarial examples requiring expert knowledge, many of today's jailbreaks are low-effort, high-impact crafted by everyday users with nothing more than cleverly worded prompts. This paper presents a systems-style investigation into how non-experts reliably circumvent safety mechanisms through techniques such as multi-turn narrative escalation, lexical camouflage, implication chaining, fictional impersonation, and subtle semantic edits. We propose a unified taxonomy of prompt-level jailbreak strategies spanning both text-output and T2I models, grounded in empirical case studies across popular APIs. Our analysis reveals that every stage of the moderation pipeline, from input filtering to output validation, can be bypassed with accessible strategies. We conclude by highlighting the urgent need for context-aware defenses that reflect the ease with which these jailbreaks can be reproduced in real-world settings.

🔍 Key Points

  • The paper identifies various prompt-based strategies, termed 'jailbreaks', that users can employ to bypass the moderation mechanisms of large language models (LLMs) and text-to-image systems (T2Is).
  • It proposes a unified taxonomy of these prompt-based jailbreak strategies, highlighting their effectiveness across different models and modes of interaction (text and image).
  • The analysis reveals specific vulnerabilities in moderation pipelines, demonstrating how multi-turn dialog and contextual framing can exploit these weaknesses to generate inappropriate or harmful outputs.
  • The authors present empirical case studies showing high success rates of various jailbreak methods, emphasizing that these techniques require little to no technical expertise, making them accessible to the general public.
  • The conclusion emphasizes the urgent need for context-aware defenses in moderation systems to address these emerging threats effectively.

💡 Why This Paper Matters

This paper is highly relevant as it uncovers the serious vulnerabilities in current AI moderation systems, which can be exploited by malicious actors through simple prompt manipulation. Its findings underscore the necessity for more sophisticated and adaptive safety mechanisms to prevent misuse of AI in high-stakes environments, emphasizing the importance of continuous improvement in model safety.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper significant as it not only details novel methods and tactics for jailbreaks but also highlights the practical implications of these exploits on LLMs and T2Is. It presents empirical evidence of the effectiveness of these strategies, showcasing the gap between current defense mechanisms and the creative methods employed by users, thus guiding future research directions on AI safety and robustness.

📚 Read the Full Paper