AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling

📄 Abstract

Recent advancements in jailbreaking large language models (LLMs), such as AutoDAN-Turbo, have demonstrated the power of automated strategy discovery. AutoDAN-Turbo employs a lifelong learning agent to build a rich library of attack strategies from scratch. While highly effective, its test-time generation process involves sampling a strategy and generating a single corresponding attack prompt, which may not fully exploit the potential of the learned strategy library. In this paper, we propose to further improve the attack performance of AutoDAN-Turbo through test-time scaling. We introduce two distinct scaling methods: Best-of-N and Beam Search. The Best-of-N method generates N candidate attack prompts from a sampled strategy and selects the most effective one based on a scorer model. The Beam Search method conducts a more exhaustive search by exploring combinations of strategies from the library to discover more potent and synergistic attack vectors. According to the experiments, the proposed methods significantly boost performance, with Beam Search increasing the attack success rate by up to 15.6 percentage points on Llama-3.1-70B-Instruct and achieving a nearly 60\% relative improvement against the highly robust GPT-o4-mini compared to the vanilla method.

🔍 Key Points

Introduction of AutoDAN-Reasoning, a framework enhancing AutoDAN-Turbo for jailbreaking LLMs through test-time scaling methods.
Development of Best-of-N and Beam Search scaling techniques to improve attack performance and efficacy by generating multiple attack prompts and exploring strategy combinations.
Experimental results demonstrate significant improvements in attack success rates, with Beam Search achieving up to a 15.6 percentage point increase compared to vanilla AutoDAN-Turbo.
The framework utilizes a lifelong learning agent to continuously enrich its library of effective attack strategies from previous experiences.
Both test-time scaling methods have been validated across various models, including robust models like GPT-o4-mini, revealing their versatility and effectiveness in real-world scenarios.

💡 Why This Paper Matters

The proposed AutoDAN-Reasoning framework presents a significant innovation in the field of AI security by providing advanced techniques for exploring vulnerabilities in large language models (LLMs). By enhancing the original AutoDAN-Turbo framework with test-time scaling strategies, this research addresses critical gaps in existing methods, offering a robust approach to automatically discovering and executing jailbreak tactics. The substantial improvements in attack success rates highlight the practical implications of this research in effective cybersecurity defenses against advanced AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it outlines sophisticated methodologies to exploit vulnerabilities in large language models—a growing concern in AI safety. The introduction of test-time scaling methods like Best-of-N and Beam Search presents new avenues for enhancing attack strategies, thus providing insights into potential weaknesses of LLMs. Furthermore, the empirical evidence demonstrating significant performance gains encourages researchers to reconsider current approaches to model safety and adversarial robustness.

AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper