← Back to Library

Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Authors: Olga E. Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Daniele Nardi

Published: 2025-10-14

arXiv ID: 2510.13893v1

Added to Library: 2025-10-17 04:01 UTC

Red Teaming

📄 Abstract

Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than the jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcome of our experiments are manifold. First, we developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families, including impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmark a popular LLM for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.

🔍 Key Points

  • Developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, categorizing them into seven families such as impersonation, persuasion, and data poisoning, which enhances understanding of the diverse skillset of jailbreaking attacks.
  • Conducted a structured red-teaming challenge with 48 participants that produced a novel Italian dataset of 1364 multi-turn adversarial dialogues, providing a rich resource for future research and model testing.
  • Demonstrated the effectiveness of taxonomy-guided prompting for improving automatic detection of jailbreak attempts, leading to a significant increase in detection accuracy from 65.9% to 78.0% when using the taxonomy.
  • Analyzed the prevalence and success rates of the different attack strategies, revealing insights about the real-world exploitation of model vulnerabilities and the conditions under which specific techniques succeed.
  • Outlined future research directions, including plans for further analyzing incremental attack patterns and updating the taxonomy to remain relevant as new techniques emerge.

💡 Why This Paper Matters

This paper is significant as it addresses the pressing issue of jailbreaking in AI systems, which poses serious safety risks to Large Language Models (LLMs). By establishing a new, detailed taxonomy and an associated dataset, the authors provide critical tools for future research aimed at safeguarding AI applications. The findings not only enhance our understanding of adversarial prompting but also improve the robustness of models against such attacks, which is essential for the deployment of LLMs in sensitive domains.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper essential because it not only provides a robust framework for classifying jailbreaking techniques but also empirically validates the effectiveness of these classifications in improving detection systems. Given the growing complexity and sophistication of adversarial attacks, the insights gained from the structured challenge and the taxonomy could inform the design of more resilient AI systems and contribute to the development of better safety mechanisms against emerging threats.

📚 Read the Full Paper