Defending Against Prompt Injection With a Few DefensiveTokens

Authors: Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, David Wagner

Published: 2025-07-10

arXiv ID: 2507.07974v2

Added to Library: 2025-11-11 14:11 UTC

Red Teaming

📄 Abstract

When large language model (LLM) systems interact with external data to perform complex tasks, a new attack, namely prompt injection, becomes a significant threat. By injecting instructions into the data accessed by the system, the attacker is able to override the initial user task with an arbitrary task directed by the attacker. To secure the system, test-time defenses, e.g., defensive prompting, have been proposed for system developers to attain security only when needed in a flexible manner. However, they are much less effective than training-time defenses that change the model parameters. Motivated by this, we propose DefensiveToken, a test-time defense with prompt injection robustness comparable to training-time alternatives. DefensiveTokens are newly inserted as special tokens, whose embeddings are optimized for security. In security-sensitive cases, system developers can append a few DefensiveTokens before the LLM input to achieve security with a minimal utility drop. In scenarios where security is less of a concern, developers can simply skip DefensiveTokens; the LLM system remains the same as there is no defense, generating high-quality responses. Thus, DefensiveTokens, if released alongside the model, allow a flexible switch between the state-of-the-art (SOTA) utility and almost-SOTA security at test time. The code is available at https://github.com/Sizhe-Chen/DefensiveToken.

🔍 Key Points

Introduction of DefensiveToken as a test-time defense mechanism against prompt injection attacks in large language models (LLMs), achieving security results comparable to training-time defenses.
The design of DefensiveTokens leverages optimized embeddings that can be toggled on or off, allowing developers to balance security and utility as needed in different contexts.
Extensive experiments demonstrate that DefensiveTokens significantly reduce attack success rates across multiple prompt injection benchmarks while minimizing degradation in the quality of model responses.
Ablation studies reveal key insights into the optimal configuration of DefensiveTokens, including their initialization, number, and position within the input sequence.
This paper highlights the importance of developer flexibility in security implementations without necessitating deeper model alterations, paving the way for safer LLM deployment.

💡 Why This Paper Matters

This paper presents a significant advancement in the field of AI security, particularly focusing on the robustness of large language models against prompt injection—a critical vulnerability. The introduction of DefensiveTokens not only enhances the security of LLMs but also emphasizes the need for adaptable defenses that can maintain high utility without compromising functionality. As LLMs continue to integrate into various applications, this research provides a framework that developers can employ to protect systems intelligently and flexibly.

🎯 Why It's Interesting for AI Security Researchers

This paper will be of great interest to AI security researchers because it tackles a pressing issue in the deployment of large language models: the vulnerability to prompt injection attacks. By proposing a novel defensive mechanism that balances security and usability, it addresses a core concern in the design and implementation of AI systems. The findings and methodologies outlined provide a basis for further exploration and improvement of security mechanisms, highlighting potential avenues for future research in AI safety.

Defending Against Prompt Injection With a Few DefensiveTokens

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper