AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Authors: Aashray Reddy, Andrew Zagula, Nicholas Saban

Published: 2025-11-04

arXiv ID: 2511.02376v1

Added to Library: 2025-11-05 05:03 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.

🔍 Key Points

AutoAdv achieves up to 95% attack success rate on Llama-3.1-8B across six turns, significantly outperforming single-turn baselines by 24%.
The framework utilizes three adaptive mechanisms: a pattern manager that learns from past successful attacks, a temperature manager that adjusts sampling dynamically, and a two-phase rewriting strategy that masks harmful intents.
Multi-turn jailbreaking exposes the limitations of single-turn alignment strategies, indicating that current safety mechanisms are not robust enough for extended interactions.
Extensive evaluations demonstrate persistent vulnerabilities across multiple large language models (LLMs), emphasizing the need for enhanced multi-turn-aware defenses in AI systems.

💡 Why This Paper Matters

The paper presents AutoAdv as a groundbreaking approach in the field of adversarial prompting by effectively demonstrating the vulnerabilities of large language models during multi-turn interactions. Its high attack success rates and innovative combination of refinement strategies highlight the inadequacy of existing safety measures, making it crucial for advancing the research on AI safety and adversarial robustness.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it uncovers significant weaknesses in large language models, particularly in their handling of multi-turn conversations. As AI systems become increasingly integrated into various applications, understanding and mitigating these vulnerabilities is essential for ensuring safe and responsible deployment in real-world scenarios.

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper