← Back to Library

Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience

Authors: Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Bin Ji, Jun Ma, Xiaodong Liu, Jing Wang, Feilong Bao, Jianfeng Zhang, Baosheng Wang, Jie Yu

Published: 2025-08-25

arXiv ID: 2508.19292v1

Added to Library: 2025-08-28 04:02 UTC

Red Teaming

πŸ“„ Abstract

Large language models (LLMs) generate human-aligned content under certain safety constraints. However, the current known technique ``jailbreak prompt'' can circumvent safety-aligned measures and induce LLMs to output malicious content. Research on Jailbreaking can help identify vulnerabilities in LLMs and guide the development of robust security frameworks. To circumvent the issue of attack templates becoming obsolete as models evolve, existing methods adopt iterative mutation and dynamic optimization to facilitate more automated jailbreak attacks. However, these methods face two challenges: inefficiency and repetitive optimization, as they overlook the value of past attack experiences. To better integrate past attack experiences to assist current jailbreak attempts, we propose the \textbf{JailExpert}, an automated jailbreak framework, which is the first to achieve a formal representation of experience structure, group experiences based on semantic drift, and support the dynamic updating of the experience pool. Extensive experiments demonstrate that JailExpert significantly improves both attack effectiveness and efficiency. Compared to the current state-of-the-art black-box jailbreak methods, JailExpert achieves an average increase of 17\% in attack success rate and 2.7 times improvement in attack efficiency. Our implementation is available at \href{https://github.com/xiZAIzai/JailExpert}{XiZaiZai/JailExpert}

πŸ” Key Points

  • Introduction of JailExpert: This paper presents JailExpert, an automated jailbreak framework that integrates past attack experiences to significantly improve both the effectiveness and efficiency of jailbreak attempts on large language models (LLMs).
  • Experience Formalization: JailExpert formalizes jailbreak experiences through a structured format that encompasses detailed information about attacks, including mutation strategies, templates, initial instructions, and success/failure counts, facilitating dynamic adaptability.
  • Jailbreak Semantic Drift: The concept of 'jailbreak semantic drift' is introduced, allowing for the grouping of experiences based on the semantic differences between instructions and complete jailbreak prompts, enhancing the efficiency of the attack strategy.
  • Comprehensive Experimental Validation: Extensive experiments demonstrate JailExpert’s superiority, achieving an average increase of 17% in attack success rate and 2.7 times improvement in attack efficiency compared to existing methods, across a range of state-of-the-art LLMs.
  • Robustness Against Defenses: JailExpert consistently outperformed existing defense methods, indicating the pressing need for more effective security that evolves alongside jailbreaking techniques.

πŸ’‘ Why This Paper Matters

This paper is critical in the context of AI security as it systematically addresses the vulnerabilities of large language models through innovative methods in jailbreak attacks. By leveraging past attack experiences, JailExpert not only advances the understanding of attack methodologies but also raises awareness regarding the robustness of current model defenses, indicating a pathway for future research in AI safety and resilience.

🎯 Why It's Interesting for AI Security Researchers

This paper is relevant for AI security researchers because it highlights serious vulnerabilities within widely-used LLMs, presenting a sophisticated framework to exploit them. The novel approaches outlined in JailExpert, particularly its focus on dynamic adaptability and historical experience integration, provide valuable insights to enhance the security frameworks of AI models. Additionally, the challenges posed by evolving defense mechanisms prompt further exploration and innovation in AI security strategies.

πŸ“š Read the Full Paper