PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

📄 Abstract

Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.

🔍 Key Points

Introduction of PLAGUE, a novel plug-and-play framework for creating multi-turn jailbreak attacks on Large Language Models (LLMs) that enhances adaptability, efficiency, and effectiveness.
Identification and segmentation of the attack lifecycle into three main phases: Primer, Planner, and Finisher, facilitating systematic exploration of multi-turn attacks.
Empirical results demonstrating a significant increase in Attack Success Rates (ASR), achieving up to 81.4% for OpenAI's o3 and 67.3% for Claude's Opus 4.1, outperforming previous state-of-the-art methods.
Incorporation of lifelong learning principles to enable the reuse of successful attack strategies, thereby improving diversity and adaptability of the attack designs over time.
Demonstration of PLAGUE's modularity allowing integration of diverse existing attacks, highlighting the flexibility and robustness of the framework.

💡 Why This Paper Matters

This paper presents PLAGUE, a breakthrough framework that addresses critical vulnerabilities in multi-turn interactions with LLMs. By improving the effectiveness and success rates of jailbreaking methods, it enhances our understanding of LLM weaknesses, paving the way for more robust defenses. Its innovative structure not only contributes to theoretical knowledge in AI security but also provides practical methodologies for evaluating LLM safety, making this research highly relevant in an era where LLM deployment is widespread and their potential for misuse is a growing concern.

🎯 Why It's Interesting for AI Security Researchers

The findings and methods presented in this paper are crucial for AI security researchers as they shed light on the vulnerabilities of LLMs in multi-turn scenarios, a significant area of concern given the complexity and sophistication of users' interactions with these models. Understanding how attackers could exploit these vulnerabilities informs the development of better security protocols and defensive measures against real-world threats.

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper