← Back to Library

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Authors: Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, Florian Tramèr

Published: 2025-10-10

arXiv ID: 2510.09023v1

Added to Library: 2025-10-13 12:00 UTC

Red Teaming Safety

πŸ“„ Abstract

How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness.

πŸ” Key Points

  • This paper demonstrates that evaluating language model defenses against adaptive and computationally strong attackers is crucial as current methodologies rely on weak and static attack evaluations that can lead to false claims of robustness.
  • The authors successfully bypassed 12 recent defenses by applying optimized adaptive attacks leading to attack success rates exceeding 90%, revealing vulnerabilities in defenses previously thought to be robust.
  • The paper highlights the effectiveness of human-led red-teaming approaches over automated methods, showcasing the nuanced understanding human attackers possess in adapting their strategies against defenses.

πŸ’‘ Why This Paper Matters

The paper underscores the necessity for a paradigm shift in how defenses against prompt injections and jailbreak attacks are evaluated. By illustrating the limitations of current evaluation techniques and presenting more effective adaptive attack strategies, it makes a strong case for incorporating stronger attacker models in future defense research. This is critical for developing reliable defenses in an era of increasingly powerful AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper will be of great interest to AI security researchers as it challenges existing defense evaluation frameworks and pushes for more rigorous testing against adaptive threats. The findings not only expose vulnerabilities in current defenses but also provide insights on how to design more robust systems against sophisticated attackers, which is essential for the future of safe AI deployment.

πŸ“š Read the Full Paper