← Back to Library

SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models

Authors: Hanbin Hong, Shuya Feng, Nima Naderloui, Shenao Yan, Jingyu Zhang, Biying Liu, Ali Arastehfard, Heqing Huang, Yuan Hong

Published: 2025-10-17

arXiv ID: 2510.15476v2

Added to Library: 2025-10-22 02:00 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) have rapidly become integral to real-world applications, powering services across diverse sectors. However, their widespread deployment has exposed critical security risks, particularly through jailbreak prompts that can bypass model alignment and induce harmful outputs. Despite intense research into both attack and defense techniques, the field remains fragmented: definitions, threat models, and evaluation criteria vary widely, impeding systematic progress and fair comparison. In this Systematization of Knowledge (SoK), we address these challenges by (1) proposing a holistic, multi-level taxonomy that organizes attacks, defenses, and vulnerabilities in LLM prompt security; (2) formalizing threat models and cost assumptions into machine-readable profiles for reproducible evaluation; (3) introducing an open-source evaluation toolkit for standardized, auditable comparison of attacks and defenses; (4) releasing JAILBREAKDB, the largest annotated dataset of jailbreak and benign prompts to date;\footnote{The dataset is released at \href{https://huggingface.co/datasets/youbin2014/JailbreakDB}{\textcolor{purple}{https://huggingface.co/datasets/youbin2014/JailbreakDB}}.} and (5) presenting a comprehensive evaluation platform and leaderboard of state-of-the-art methods \footnote{will be released soon.}. Our work unifies fragmented research, provides rigorous foundations for future studies, and supports the development of robust, trustworthy LLMs suitable for high-stakes deployment.

🔍 Key Points

  • Holistic taxonomy for organizing attacks, defenses, and vulnerabilities in LLM prompt security, providing a structured framework for the field.
  • Formalization of machine-readable threat models and assumptions for reproducible evaluation of security methods.
  • Introduction of an open-source evaluation toolkit, enhancing the standardization and comparability of different security techniques.
  • Release of JAILBREAKDB, a comprehensive dataset containing unique jailbroken and benign prompts, aiding empirical research in prompt security.
  • Unified evaluation platform presenting insights into the effectiveness of various state-of-the-art attacks and defenses.

💡 Why This Paper Matters

The paper offers a significant step forward in addressing the fragmented landscape of prompt security in LLMs by providing a systematic approach to categorizing and evaluating threats and defenses. Its contributions, including a comprehensive taxonomy, a robust evaluation framework, and a large-scale annotated dataset, are foundational for future research aimed at enhancing the safety and reliability of language models in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it not only identifies and categorizes various vulnerabilities within LLMs but also equips the community with tools and datasets to robustly evaluate these vulnerabilities. By establishing a systematic framework for prompt security, it paves the way for more effective attack and defense mechanisms, fostering increased understanding and advancement in the security of AI applications, especially as they are integrated into high-stakes environments.

📚 Read the Full Paper