Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks

📄 Abstract

As Large Language Models (LLMs) are increasingly deployed in safety-critical domains, rigorously evaluating their robustness against adversarial jailbreaks is essential. However, current safety evaluations often overestimate robustness because existing automated attacks are limited by restrictive assumptions. They typically rely on handcrafted priors or require white-box access for gradient propagation. We challenge these constraints by demonstrating that token-level iterative optimization can succeed without gradients or priors. We introduce RAILS (RAndom Iterative Local Search), a framework that operates solely on model logits. RAILS matches the effectiveness of gradient-based methods through two key innovations: a novel auto-regressive loss that enforces exact prefix matching, and a history-based selection strategy that bridges the gap between the proxy optimization objective and the true attack success rate. Crucially, by eliminating gradient dependency, RAILS enables cross-tokenizer ensemble attacks. This allows for the discovery of shared adversarial patterns that generalize across disjoint vocabularies, significantly enhancing transferability to closed-source systems. Empirically, RAILS achieves near 100% success rates on multiple open-source models and high black-box attack transferability to closed-source systems like GPT and Gemini.

🔍 Key Points

Introduction of RAILS (RAndom Iterative Local Search), a gradient-free attack framework that operates solely on model logits, allowing for effective jailbreaking of LLMs without requiring white-box access or handcrafted priors.
Innovative auto-regressive loss that enforces exact prefix matching, improving attack effectiveness over previous methods that used standard losses.
Development of a history-based candidate selection strategy that aids in bridging the optimization objective and true attack success rate, significantly enhancing the robustness of the attacks.
RAILS enables cross-tokenizer ensemble attacks, allowing for the discovery of shared vulnerabilities across disjoint vocabularies, which enhances the transferability of attacks to black-box models such as GPT and Gemini.
Empirical results show RAILS achieving high attack success rates on open-source models and considerable transferability to closed-source systems, indicating its effectiveness as a robust automated red teaming tool.

💡 Why This Paper Matters

This paper presents RAILS, a groundbreaking framework that redefines the capabilities of adversarial attacks on large language models by eliminating the dependencies on gradients and handcrafted templates. By employing innovative methodologies and demonstrating substantial empirical success, it addresses critical concerns related to the security and robustness of LLMs in real-world applications, making a significant contribution to the field of AI safety.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly pertinent as it exposes the vulnerabilities of LLMs to adversarial attacks, offering new methodologies that bypass traditional limitations in existing approaches. The findings underscore the urgent need for improved model defenses and methodologies for evaluating AI safety, which are critical as LLMs become increasingly integrated into safety-critical applications.

Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper