Dynamic Target Attack

📄 Abstract

Existing gradient-based jailbreak attacks typically optimize an adversarial suffix to induce a fixed affirmative response. However, this fixed target usually resides in an extremely low-density region of a safety-aligned LLM's output distribution conditioned on diverse harmful inputs. Due to the substantial discrepancy between the target and the original output, existing attacks require numerous iterations to optimize the adversarial prompt, which might still fail to induce the low-probability target response from the target LLM. In this paper, we propose Dynamic Target Attack (DTA), a new jailbreaking framework relying on the target LLM's own responses as targets to optimize the adversarial prompts. In each optimization round, DTA iteratively samples multiple candidate responses directly from the output distribution conditioned on the current prompt, and selects the most harmful response as a temporary target for prompt optimization. In contrast to existing attacks, DTA significantly reduces the discrepancy between the target and the output distribution, substantially easing the optimization process to search for an effective adversarial prompt. Extensive experiments demonstrate the superior effectiveness and efficiency of DTA: under the white-box setting, DTA only needs 200 optimization iterations to achieve an average attack success rate (ASR) of over 87\% on recent safety-aligned LLMs, exceeding the state-of-the-art baselines by over 15\%. The time cost of DTA is 2-26 times less than existing baselines. Under the black-box setting, DTA uses Llama-3-8B-Instruct as a surrogate model for target sampling and achieves an ASR of 85\% against the black-box target model Llama-3-70B-Instruct, exceeding its counterparts by over 25\%.

🔍 Key Points

Introduction of Dynamic Target Attack (DTA), a novel jailbreaking framework that dynamically samples harmful responses from Large Language Models (LLMs) to optimize adversarial prompts.
DTA improves efficiency by significantly reducing the required training iterations (only 200) to achieve high Attack Success Rates (over 87%) on recent safety-aligned LLMs, marking an improvement of over 15% compared to existing methods.
DTA demonstrates outstanding performance in both white-box (average ASR of 87%) and black-box settings (ASR of 85%), showcasing its adaptability and robustness against various LLMs.
The iterative sampling-optimization cycle in DTA allows continuous realignment of targets, reducing the error between the target and the output distributions, which is a substantial advancement over fixed target strategies.
Extensive experiments highlight DTA's substantial reduction in time and iteration costs (2-26 times less), which enhances the practicality of adversarial attacks against LLMs.

💡 Why This Paper Matters

The Dynamic Target Attack (DTA) framework presents a significant leap in adversarial attacks on Large Language Models, providing a new approach that leverages the model's own output characteristics to enhance attack success and efficiency. Its ability to outperform existing methods under both white-box and black-box conditions demonstrates its potential in the ongoing efforts to evaluate and fortify the safety and robustness of LLMs against adversarial threats. This progress is crucial as LLMs are increasingly integrated into applications that demand ethical and secure AI interactions.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers as it addresses the vulnerabilities inherent in safety-aligned Large Language Models. With the increasing deployment of such models in real-world systems, understanding and mitigating adversarial risks is paramount. The novel methods and findings of DTA offer insights into effective attack strategies, thus enabling researchers to develop better defense mechanisms and safety protocols in AI systems.

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper