← Back to Library

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models

Authors: Pavlos Ntais

Published: 2025-10-24

arXiv ID: 2510.22085v1

Added to Library: 2025-10-28 04:01 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) remain vulnerable to sophisticated prompt engineering attacks that exploit contextual framing to bypass safety mechanisms, posing significant risks in cybersecurity applications. We introduce Jailbreak Mimicry, a systematic methodology for training compact attacker models to automatically generate narrative-based jailbreak prompts in a one-shot manner. Our approach transforms adversarial prompt discovery from manual craftsmanship into a reproducible scientific process, enabling proactive vulnerability assessment in AI-driven security systems. Developed for the OpenAI GPT-OSS-20B Red-Teaming Challenge, we use parameter-efficient fine-tuning (LoRA) on Mistral-7B with a curated dataset derived from AdvBench, achieving an 81.0% Attack Success Rate (ASR) against GPT-OSS-20B on a held-out test set of 200 items. Cross-model evaluation reveals significant variation in vulnerability patterns: our attacks achieve 66.5% ASR against GPT-4, 79.5% on Llama-3 and 33.0% against Gemini 2.5 Flash, demonstrating both broad applicability and model-specific defensive strengths in cybersecurity contexts. This represents a 54x improvement over direct prompting (1.5% ASR) and demonstrates systematic vulnerabilities in current safety alignment approaches. Our analysis reveals that technical domains (Cybersecurity: 93% ASR) and deception-based attacks (Fraud: 87.8% ASR) are particularly vulnerable, highlighting threats to AI-integrated threat detection, malware analysis, and secure systems, while physical harm categories show greater resistance (55.6% ASR). We employ automated harmfulness evaluation using Claude Sonnet 4, cross-validated with human expert assessment, ensuring reliable and scalable evaluation for cybersecurity red-teaming. Finally, we analyze failure mechanisms and discuss defensive strategies to mitigate these vulnerabilities in AI for cybersecurity.

🔍 Key Points

  • Automated Jailbreak Generation: The study successfully demonstrates an automated methodology to generate narrative-based jailbreak prompts, achieving an 81.0% Attack Success Rate (ASR) against GPT-OSS-20B, significantly enhancing adversarial capabilities compared to previous manual approaches.
  • Cross-Model Vulnerability Analysis: The research provides a comprehensive vulnerability mapping across multiple large language models (LLMs), identifying systematic failings in safety mechanisms, which vary significantly among different models.
  • Evaluation Framework: A novel hybrid evaluation approach is introduced, combining automated AI scoring with human expert validation, ensuring more accurate assessments of jailbreak prompts and reinforcing the findings with empirical data.
  • Proactive Defense Insights: The study highlights systemic vulnerabilities in existing safety alignment approaches, emphasizing the need for context-aware safety systems and suggesting multiple strategies for defending against sophisticated adversarial prompts.
  • Attack Patterns: The paper elucidates distinct attack patterns—creative misdirection, functional utility, and authoritative context—demonstrating how contextual framing can be exploited to bypass safety protocols.

💡 Why This Paper Matters

This paper is crucial in addressing the vulnerabilities of large language models to adversarial attacks, particularly through the lens of contextual manipulation. By systematically automating the discovery of narrative-based jailbreaks, the authors have transformed how adversarial prompt crafting can be approached, making it replicable and scalable. The findings urge the AI community to reconsider current safety alignment strategies, advocating for innovations in defensive architectures and training methodologies.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant for AI security researchers as it uncovers critical vulnerabilities in existing safety mechanisms of large language models, demonstrating how these vulnerabilities can be systematically exploited through advanced attack methodologies. It provides valuable insights into the nature of these attacks while also proposing fundamental changes to enhance AI safety protocols, making it a vital contribution towards securing AI systems.

📚 Read the Full Paper