Defenses Against Prompt Attacks Learn Surface Heuristics

📄 Abstract

Large language models (LLMs) are increasingly deployed in security-sensitive applications, where they must follow system- or developer-specified instructions that define the intended task behavior, while completing benign user requests. When adversarial instructions appear in user queries or externally retrieved content, models may override intended logic. Recent defenses rely on supervised fine-tuning with benign and malicious labels. Although these methods achieve high attack rejection rates, we find that they rely on narrow correlations in defense data rather than harmful intent, leading to systematic rejection of safe inputs. We analyze three recurring shortcut behaviors induced by defense fine-tuning. \emph{Position bias} arises when benign content placed later in a prompt is rejected at much higher rates; across reasoning benchmarks, suffix-task rejection rises from below \textbf{10\%} to as high as \textbf{90\%}. \emph{Token trigger bias} occurs when strings common in attack data raise rejection probability even in benign contexts; inserting a single trigger token increases false refusals by up to \textbf{50\%}. \emph{Topic generalization bias} reflects poor generalization beyond the defense data distribution, with defended models suffering test-time accuracy drops of up to \textbf{40\%}. These findings suggest that current prompt-injection defenses frequently respond to attack-like surface patterns rather than the underlying intent. We introduce controlled diagnostic datasets and a systematic evaluation across two base models and multiple defense pipelines, highlighting limitations of supervised fine-tuning for reliable LLM security.

🔍 Key Points

Identification of shortcut biases in defense mechanisms: The paper identifies three key biases in LLM defenses (Position Bias, Token Trigger Bias, and Topic Generalization Bias) that compromise the effectiveness of prompt injection defenses and lead to false refusals of benign inputs.
Controlled diagnostic datasets: The authors introduce systematic diagnostic datasets specifically designed to evaluate the effect of positional context, token identity, and topical familiarity on defense performance, uncovering the reliance on surface heuristics rather than the underlying intent.
Empirical analysis of fine-tuning methods: The study empirically evaluates existing defense methods (StrucQ and SecAlign) against baseline models, demonstrating substantial accuracy drops on benign tasks due to induced shortcut learning.
Call for intent-aware evaluations: The findings emphasize the need for intent-aware evaluations that focus on improving model reliability across benign use cases, in addition to simply rejecting malicious inputs.

💡 Why This Paper Matters

This paper presents significant findings about the vulnerabilities of large language model defenses against prompt injection attacks. The identification of shortcut biases brings new insights into the failure modes of current defense strategies, highlighting a crucial need for improved methodologies that analyze and enhance the reliability of these defenses in real-world applications. By showcasing the limitations of existing approaches, the research paves the way for future work in AI security that prioritizes intent over surface patterns, which is essential for fostering trust and safety in AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is of high interest to AI security researchers as it uncovers critical weaknesses in the defenses against prompt injection attacks, a major concern in deploying LLMs in sensitive applications. By providing a detailed analysis of shortcut biases in current model defenses, the research offers valuable insights that can inform the development of more robust, intent-aware defense mechanisms, which are essential for enhancing the security and trustworthiness of AI systems.

Defenses Against Prompt Attacks Learn Surface Heuristics

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper