SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression

📄 Abstract

Large language models (LLMs) have achieved widespread adoption across numerous applications. However, many LLMs are vulnerable to malicious attacks even after safety alignment. These attacks typically bypass LLMs' safety guardrails by wrapping the original malicious instructions inside adversarial jailbreaks prompts. Previous research has proposed methods such as adversarial training and prompt rephrasing to mitigate these safety vulnerabilities, but these methods often reduce the utility of LLMs or lead to significant computational overhead and online latency. In this paper, we propose SecurityLingua, an effective and efficient approach to defend LLMs against jailbreak attacks via security-oriented prompt compression. Specifically, we train a prompt compressor designed to discern the "true intention" of the input prompt, with a particular focus on detecting the malicious intentions of adversarial prompts. Then, in addition to the original prompt, the intention is passed via the system prompt to the target LLM to help it identify the true intention of the request. SecurityLingua ensures a consistent user experience by leaving the original input prompt intact while revealing the user's potentially malicious intention and stimulating the built-in safety guardrails of the LLM. Moreover, thanks to prompt compression, SecurityLingua incurs only a negligible overhead and extra token cost compared to all existing defense methods, making it an especially practical solution for LLM defense. Experimental results demonstrate that SecurityLingua can effectively defend against malicious attacks and maintain utility of the LLM with negligible compute and latency overhead. Our code is available at https://aka.ms/SecurityLingua.

🔍 Key Points

Introduction of SecurityLingua, a new framework to defend large language models (LLMs) against jailbreak attacks through security-aware prompt compression.
A unique prompt compressor is trained to identify the "true intention" behind prompts, allowing the original input to stay intact while revealing malicious intent.
SecurityLingua demonstrates significantly lower computational overhead, incurring negligible extra token cost compared to existing defense strategies, making it practical for real-world applications.
Experimental results show that SecurityLingua achieves a robust defense against multiple attack methods while maintaining model utility across various downstream tasks, outperforming current state-of-the-art defenses.
The design considers user experience, allowing for consistent interactions without rejecting benign queries.

💡 Why This Paper Matters

The paper presents a substantial advancement in safeguarding large language models against adversarial jailbreak techniques. By utilizing a novel approach to prompt compression, SecurityLingua effectively mitigates the risks posed by malicious inputs while preserving the original prompt's context. This balance of security and utility is critical as AI systems are increasingly integrated into sensitive and high-stakes environments.

🎯 Why It's Interesting for AI Security Researchers

This paper is of particular interest to AI security researchers as it addresses a pressing issue in the field: the vulnerability of LLMs to adversarial attacks that exploit prompt engineering. SecurityLingua not only offers a new methodology for addressing these vulnerabilities but also highlights the significance of user experience and operational efficiency, which are crucial for deploying AI systems responsibly. The practical implications of the findings and the rigorous testing against contemporary methods provide valuable insights into the intersection of language model safety and operational effectiveness.

SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper