← Back to Library

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

Authors: Neeladri Bhuiya, Madhav Aggarwal, Diptanshu Purwar

Published: 2025-10-20

arXiv ID: 2510.17947v1

Added to Library: 2025-10-22 03:02 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.

🔍 Key Points

  • Introduction of PLAGUE, a novel framework for multi-turn attacks on LLMs, that improves adaptability and performance.
  • The framework is structured into three phases: Primer, Planner, and Finisher, which systematically enhance the attack process.
  • Demonstrated significant improvements in attack success rates (ASR) over existing methods, achieving 81.4% ASR on OpenAI's o3 model and 67.3% on Claude's Opus 4.1 within limited query budgets.
  • PLAGUE incorporates principles from lifelong-learning and agentic frameworks, enhancing its ability to adapt and evolve over time.
  • Evaluation against leading LLMs shows PLAGUE outperforms traditional single-turn and multi-turn attacks, emphasizing its practical effectiveness.

💡 Why This Paper Matters

The paper is significant as it outlines PLAGUE, which represents a major advancement in the field of AI security, particularly by addressing the challenges associated with multi-turn interactions that were previously overlooked. It not only enhances the understanding of LLM vulnerabilities but also provides a toolkit for future research into LLM safety and robustness.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it tackles the growing concerns regarding the safety and reliability of large language models, especially in multi-turn contexts where subtle manipulations can lead to harmful outcomes. The novel strategies and empirical findings present new avenues for evaluating and strengthening the defenses of LLMs against adversarial attacks and ensure responsible deployment in real-world applications.

📚 Read the Full Paper