Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems

Authors: William Hackett, Lewis Birch, Stefan Trawicki, Neeraj Suri, Peter Garraghan

Published: 2025-04-15

arXiv ID: 2504.11168v3

Added to Library: 2025-11-11 14:20 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) guardrail systems are designed to protect against prompt injection and jailbreak attacks. However, they remain vulnerable to evasion techniques. We demonstrate two approaches for bypassing LLM prompt injection and jailbreak detection systems via traditional character injection methods and algorithmic Adversarial Machine Learning (AML) evasion techniques. Through testing against six prominent protection systems, including Microsoft's Azure Prompt Shield and Meta's Prompt Guard, we show that both methods can be used to evade detection while maintaining adversarial utility achieving in some instances up to 100% evasion success. Furthermore, we demonstrate that adversaries can enhance Attack Success Rates (ASR) against black-box targets by leveraging word importance ranking computed by offline white-box models. Our findings reveal vulnerabilities within current LLM protection mechanisms and highlight the need for more robust guardrail systems.

🔍 Key Points

Demonstration of two effective evasion techniques for bypassing Large Language Model (LLM) guardrails: traditional character injection and algorithmic Adversarial Machine Learning (AML) techniques.
Empirical analysis of six major LLM guardrail systems, revealing vulnerabilities and a high success rate (up to 100%) in evading detection with certain methods, like Emoji Smuggling and Bidirectional Text.
Introduction of the concept of word importance transferability, showing how adversaries can leverage white-box models to enhance attack success rates against black-box targets.
Provision of a detailed experimental setup and results that highlight the effectiveness of character injection techniques compared to AML approaches, emphasizing the weaknesses in current LLM defenses.
Identification of critical weaknesses in existing LLM guardrail systems, calling for the development of more robust and resilient mechanisms against adversarial attacks.

💡 Why This Paper Matters

This paper is crucial as it sheds light on the vulnerabilities in LLM guardrails designed to prevent prompt injection and jailbreak attacks. By empirically demonstrating the effectiveness of evasion techniques, the research emphasizes the need for improved security measures in AI systems, which are increasingly deployed in various applications where safety and reliability are paramount.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of significant interest to AI security researchers as it provides empirical evidence of the limitations of current LLM protective systems. The findings highlight the urgency for developing more sophisticated defenses against evolving adversarial tactics, thereby contributing to advancing the field of AI security and promoting safer AI deployment in sensitive environments.

Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper