Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models

📄 Abstract

Large Language Models (LLMs) face a significant threat from multi-turn jailbreak attacks, where adversaries progressively steer conversations to elicit harmful outputs. However, the practical effectiveness of existing attacks is undermined by several critical limitations: they struggle to maintain a coherent progression over long interactions, often losing track of what has been accomplished and what remains to be done; they rely on rigid or pre-defined patterns, and fail to adapt to the LLM's dynamic and unpredictable conversational state. To address these shortcomings, we introduce Mastermind, a multi-turn jailbreak framework that adopts a dynamic and self-improving approach. Mastermind operates in a closed loop of planning, execution, and reflection, enabling it to autonomously build and refine its knowledge of model vulnerabilities through interaction. It employs a hierarchical planning architecture that decouples high-level attack objectives from low-level tactical execution, ensuring long-term focus and coherence. This planning is guided by a knowledge repository that autonomously discovers and refines effective attack patterns by reflecting on interactive experiences. Mastermind leverages this accumulated knowledge to dynamically recombine and adapt attack vectors, dramatically improving both effectiveness and resilience. We conduct comprehensive experiments against state-of-the-art models, including GPT-5 and Claude 3.7 Sonnet. The results demonstrate that Mastermind significantly outperforms existing baselines, achieving substantially higher attack success rates and harmfulness ratings. Moreover, our framework exhibits notable resilience against multiple advanced defense mechanisms.

🔍 Key Points

Introduction of Mastermind, a dynamic and self-improving multi-turn jailbreak framework for Large Language Models (LLMs) that autonomously discovers and refines attack strategies.
Utilization of a hierarchical multi-agent architecture that decouples high-level planning from low-level execution, enhancing attack coherence and adaptability during multi-turn interactions.
Implementation of knowledge accumulation and strategy-level fuzzing, allowing Mastermind to optimize attack vectors by combining abstract strategies rather than static prompts.
Empirical results showing Mastermind significantly outperforms existing multi-turn jailbreak methods, achieving higher attack success rates (ASR) and harmfulness ratings against state-of-the-art models.
Demonstration of resilience against advanced defense mechanisms, highlighting that traditional defenses are currently insufficient to mitigate multi-turn jailbreaking threats.

💡 Why This Paper Matters

This paper presents Mastermind, a groundbreaking approach in the realm of AI security that effectively addresses the limitations of existing multi-turn jailbreak attacks on LLMs. By integrating dynamic planning and knowledge-driven strategies, it not only enhances the effectiveness of jailbreaks but also underscores the pressing need for improved safety alignment in LLMs. The remarkable performance of Mastermind against advanced models illustrates the ongoing vulnerabilities in AI systems, emphasizing the importance of continuous advancements in defense mechanisms.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it uncovers critical vulnerabilities in Large Language Models (LLMs) and presents innovative methodologies to exploit these weaknesses. The introduction of Mastermind illustrates a sophisticated approach to adversarial attacks, prompting a reevaluation of current defenses. By demonstrating the potential for successful multi-turn jailbreaking, it highlights the need for more robust safety mechanisms and inspires further research into securing AI systems against evolving threats. Additionally, the findings serve as a warning about the real-world implications of adversarial attacks, making it essential reading for those developing or deploying AI technologies.

Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper