← Back to Library

An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

Authors: Xu Liu, Yan Chen, Kan Ling, Yichi Zhu, Hengrun Zhang, Guisheng Fan, Huiqun Yu

Published: 2025-11-04

arXiv ID: 2511.02356v1

Added to Library: 2025-11-05 05:03 UTC

Red Teaming

📄 Abstract

The widespread deployment of Large Language Models (LLMs) as public-facing web services and APIs has made their security a core concern for the web ecosystem. Jailbreak attacks, as one of the significant threats to LLMs, have recently attracted extensive research. In this paper, we reveal a jailbreak strategy which can effectively evade current defense strategies. It can extract valuable information from failed or partially successful attack attempts and contains self-evolution from attack interactions, resulting in sufficient strategy diversity and adaptability. Inspired by continuous learning and modular design principles, we propose ASTRA, a jailbreak framework that autonomously discovers, retrieves, and evolves attack strategies to achieve more efficient and adaptive attacks. To enable this autonomous evolution, we design a closed-loop "attack-evaluate-distill-reuse" core mechanism that not only generates attack prompts but also automatically distills and generalizes reusable attack strategies from every interaction. To systematically accumulate and apply this attack knowledge, we introduce a three-tier strategy library that categorizes strategies into Effective, Promising, and Ineffective based on their performance scores. The strategy library not only provides precise guidance for attack generation but also possesses exceptional extensibility and transferability. We conduct extensive experiments under a black-box setting, and the results show that ASTRA achieves an average Attack Success Rate (ASR) of 82.7%, significantly outperforming baselines.

🔍 Key Points

  • ASTRA framework autonomously discovers, retrieves, and evolves jailbreak attack strategies, boasting a high Attack Success Rate (ASR) of 82.7%.
  • The closed-loop mechanism of 'attack-evaluate-distill-reuse' allows for continuous learning and strategy refinement, enhancing adaptability and efficiency in jailbreak attacks.
  • Implementation of a three-tier strategy library (Effective, Promising, Ineffective) facilitates better guidance for attack generation, promoting robust attack patterns.
  • Extensive experiments on various LLMs demonstrate ASTRA's superior performance compared to existing baselines, indicating strong cross-model transferability of strategies.
  • The research provides significant insights into vulnerabilities of LLMs, contributing to advancing LLM security defense mechanisms.

💡 Why This Paper Matters

The paper presents ASTRA as a revolutionary approach to conducting jailbreak attacks against LLMs, emphasizing its innovative self-learning capabilities and effectiveness. Given the increasing reliance on LLMs within the web ecosystem, understanding and mitigating jailbreak vulnerabilities through frameworks like ASTRA is crucial for enhancing digital security practices. The insights gained through this research are pivotal for developing future AI safety mechanisms to counter evolving threats.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly valuable to AI security researchers because it not only unveils new vulnerabilities within LLMs but also offers a sophisticated, automated approach to exploring these vulnerabilities. By demonstrating how jailbreak strategies can be continuously refined, the research provides a blueprint for understanding and improving model safety protocols, essential for those involved in AI ethics, safety research, and model development.

📚 Read the Full Paper