SecAlign: Defending Against Prompt Injection with Preference Optimization

Authors: Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, Chuan Guo

Published: 2024-10-07

arXiv ID: 2410.05451v3

Added to Library: 2025-11-11 14:33 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) are becoming increasingly prevalent in modern software systems, interfacing between the user and the Internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be injected into external data sources to override the system's intended instruction and instead execute a malicious instruction. To mitigate this vulnerability, we propose a new defense called SecAlign based on the technique of preference optimization. Our defense first constructs a preference dataset with prompt-injected inputs, secure outputs (ones that respond to the legitimate instruction), and insecure outputs (ones that respond to the injection). We then perform preference optimization on this dataset to teach the LLM to prefer the secure output over the insecure one. This provides the first known method that reduces the success rates of various prompt injections to <10%, even against attacks much more sophisticated than ones seen during training. This indicates our defense generalizes well against unknown and yet-to-come attacks. Also, SecAlign models are still practical with similar utility to the one before defensive training in our evaluations. Our code is at https://github.com/facebookresearch/SecAlign

🔍 Key Points

Introduction of SecAlign, a novel defense mechanism leveraging preference optimization to mitigate prompt injection attacks on large language models (LLMs).
Demonstration that SecAlign achieves a success rate of below 10% in various prompt injection attack scenarios, significantly enhancing the robustness of LLMs against both known and unknown attacks.
SecAlign maintains comparable utility to its undefended counterparts, ensuring that the model's performance remains practical for real-world applications.
Preference optimization is framed as a dual objective problem, teaching LLMs to prefer desirable outputs over undesirable ones, thereby addressing the limitations of existing defense strategies.
The paper underscores the broad applicability of SecAlign across various models and attack types, setting a precedent for future defenses against adversarial prompts.

💡 Why This Paper Matters

This paper is relevant and important as it tackles one of the most pressing security challenges in the deployment of large language models—prompt injection attacks. By introducing SecAlign, the authors not only present an effective defense mechanism but also illustrate the importance of preference optimization in securing AI applications. This work serves as a significant advancement in ensuring the safety and reliability of LLMs in real-world environments, which is critical for their broader adoption.

🎯 Why It's Interesting for AI Security Researchers

The findings and methodologies described in this paper will be of great interest to AI security researchers as they provide novel insights into defending against prompt injection attacks that pose a serious risk to LLM applications. The innovative approach of utilizing preference optimization offers a new perspective on securing models, prompting further exploration and potential adaptations in the realm of AI security. Additionally, the empirical results demonstrating the effectiveness of SecAlign against a range of sophisticated attack types will inform and enhance ongoing research efforts in the domain.

SecAlign: Defending Against Prompt Injection with Preference Optimization

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper