← Back to Library

Multi-Turn Jailbreaks Are Simpler Than They Seem

Authors: Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick, Jasper Timm, Fei Xie, Diogo Cruz

Published: 2025-08-11

arXiv ID: 2508.07646v1

Added to Library: 2025-08-14 23:13 UTC

Red Teaming

📄 Abstract

While defenses against single-turn jailbreak attacks on Large Language Models (LLMs) have improved significantly, multi-turn jailbreaks remain a persistent vulnerability, often achieving success rates exceeding 70% against models optimized for single-turn protection. This work presents an empirical analysis of automated multi-turn jailbreak attacks across state-of-the-art models including GPT-4, Claude, and Gemini variants, using the StrongREJECT benchmark. Our findings challenge the perceived sophistication of multi-turn attacks: when accounting for the attacker's ability to learn from how models refuse harmful requests, multi-turn jailbreaking approaches are approximately equivalent to simply resampling single-turn attacks multiple times. Moreover, attack success is correlated among similar models, making it easier to jailbreak newly released ones. Additionally, for reasoning models, we find surprisingly that higher reasoning effort often leads to higher attack success rates. Our results have important implications for AI safety evaluation and the design of jailbreak-resistant systems. We release the source code at https://github.com/diogo-cruz/multi_turn_simpler

🔍 Key Points

  • Automated multi-turn jailbreak attacks can achieve over 70% success against models optimized for single-turn protections.
  • The success of multi-turn attacks is primarily attributed to the ability of attackers to learn from the model's refusals, effectively making these attacks akin to resampling single-turn attempts.
  • Higher reasoning efforts from language models surprisingly correlate with increased success rates in jailbreak attempts, suggesting that complex reasoning may make models more susceptible to harmful prompts.
  • Correlation among attack success across similar LLMs indicates that vulnerabilities in one model may be predictive of vulnerabilities in newly released models from the same provider.
  • The paper challenges existing evaluation practices and advocates for an integration of systematic resampling into safety evaluations to better assess model vulnerabilities.

💡 Why This Paper Matters

This paper is a critical contribution to the understanding of vulnerabilities in large language models (LLMs), specifically regarding their susceptibility to multi-turn jailbreak attacks. The findings challenge the notion that these multi-turn methods represent a sophisticated and distinct class of attacks, instead revealing their similarity to repeated single-turn attempts. The implications of these results extend into practical realms, particularly in the design of LLM defenses and evaluation protocols, emphasizing the need for more robust and comprehensive testing methodologies for AI safety.

🎯 Why It's Interesting for AI Security Researchers

The research presented in this paper is particularly significant for AI security researchers as it not only elucidates existing vulnerabilities in state-of-the-art language models but also questions and potentially reshapes current safety evaluation practices. By demonstrating the relative simplicity of effective multi-turn jailbreak techniques, the study highlights areas where defensive measures may be lacking, thereby informing future research and development in model robustness and adversarial defenses.

📚 Read the Full Paper