AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Authors: Aashray Reddy, Andrew Zagula, Nicholas Saban, Kevin Zhu

Published: 2025-11-04

arXiv ID: 2511.02376v2

Added to Library: 2025-11-11 03:00 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.

🔍 Key Points

Introduction of AutoAdv, a novel framework for automated multi-turn jailbreaking of large language models (LLMs) with a 95% attack success rate on Llama-3.1-8B across six turns.
Combination of three adaptive mechanisms: 1) Pattern manager for learning from successful attacks, 2) Temperature manager for adjusting sampling parameters, and 3) Two-phase rewriting strategy for disguising and refining harmful requests.
Extensive evaluations revealing persistent vulnerabilities in LLMs' safety mechanisms, particularly highlighting the inadequacy of single-turn defenses against multi-turn adaptive attacks.
Demonstration of significant performance improvements (24% increase) of multi-turn attacks over single-turn methods, underscoring the need for multi-turn-aware defenses in LLMs.
Identification of varying vulnerabilities across different LLMs, indicating that security strategies need to be model-specific.

💡 Why This Paper Matters

This paper presents critical revelations regarding the vulnerabilities of LLMs to adversarial attacks, especially within multi-turn contexts which closely resemble real-world conditions. The introduction of AutoAdv not only provides a robust framework for identifying these weaknesses but also offers insights that could inform the design of future defenses to enhance model safety and reliability.

🎯 Why It's Interesting for AI Security Researchers

The findings and methodologies presented in this paper are of considerable interest to AI security researchers as they expose the limitations of current safety measures in LLMs. The work emphasizes the need for adaptive and multi-turn defenses and contributes significantly to the discourse on LLM robustness against evolving attack strategies. The innovative framework introduced could potentially lead to the development of improved security protocols that ensure the safe deployment of AI systems.

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper