← Back to Library

Multi-turn Jailbreaking via Global Refinement and Active Fabrication

Authors: Hua Tang, Lingyong Yan, Yukun Zhao, Shuaiqiang Wang, Jizhou Huang, Dawei Yin

Published: 2025-06-22

arXiv ID: 2506.17881v1

Added to Library: 2025-06-24 04:01 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) have achieved exceptional performance across a wide range of tasks. However, they still pose significant safety risks due to the potential misuse for malicious purposes. Jailbreaks, which aim to elicit models to generate harmful content, play a critical role in identifying the underlying security threats. Recent jailbreaking primarily focuses on single-turn scenarios, while the more complicated multi-turn scenarios remain underexplored. Moreover, existing multi-turn jailbreaking techniques struggle to adapt to the evolving dynamics of dialogue as the interaction progresses. To address this limitation, we propose a novel multi-turn jailbreaking method that refines the jailbreaking path globally at each interaction. We also actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs in subsequent questions. Experimental results demonstrate the superior performance of our method compared with existing single-turn and multi-turn jailbreaking techniques across six state-of-the-art LLMs. Our code is publicly available at https://github.com/Ytang520/Multi-Turn_jailbreaking_Global-Refinment_and_Active-Fabrication.

🔍 Key Points

  • Introduction of a novel multi-turn jailbreaking method that refines prompts globally at each interaction, enhancing the effectiveness of eliciting harmful responses from large language models (LLMs).
  • Demonstrated superior performance of the proposed method compared to existing single-turn and multi-turn jailbreaking approaches across six state-of-the-art LLMs, achieving high attack success rates.
  • Active fabrication strategies are employed to manipulate model responses, removing safety warnings and revising rejected queries to maintain a path towards successful jailbreaks.
  • The effectiveness of the method is further validated through extensive experiments, including setups, main results, and ablation studies, confirming the contributions of global refinement and active fabrication in the attack process.
  • The research identifies vulnerabilities in current multi-turn defense mechanisms against jailbreaking, emphasizing the challenges LLMs face in recognizing and mitigating multi-turn attacks.

💡 Why This Paper Matters

This paper presents a critical advancement in understanding the vulnerabilities of large language models to sophisticated multi-turn jailbreaking attacks. By introducing innovative strategies for prompt refinement and response manipulation, it sheds light on the hidden risks posed by LLMs when subjected to multi-turn interactions. The findings reveal that existing defenses are inadequate against such cleverly crafted attacks, making this research a significant contribution to the field of AI safety and security.

🎯 Why It's Interesting for AI Security Researchers

The insights and methodologies outlined in this paper are crucial for AI security researchers who are working on improving the robustness of language models against adversarial threats. Given the sophisticated nature of the proposed jailbreaking method, it raises important questions about the resilience of LLMs to manipulated dialogues, which is a growing concern in the deployment of AI systems in sensitive applications. Researchers aiming to develop more effective defensive strategies or to understand emerging vulnerabilities in AI security frameworks will find the detailed analysis and methodologies presented in this paper immensely beneficial.

📚 Read the Full Paper