Defending Against Prompt Injection with DataFilter

Authors: Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, David Wagner

Published: 2025-10-22

arXiv ID: 2510.19207v1

Added to Library: 2025-11-11 14:04 UTC

Red Teaming

📄 Abstract

When large language model (LLM) agents are increasingly deployed to automate tasks and interact with untrusted external data, prompt injection emerges as a significant security threat. By injecting malicious instructions into the data that LLMs access, an attacker can arbitrarily override the original user task and redirect the agent toward unintended, potentially harmful actions. Existing defenses either require access to model weights (fine-tuning), incur substantial utility loss (detection-based), or demand non-trivial system redesign (system-level). Motivated by this, we propose DataFilter, a test-time model-agnostic defense that removes malicious instructions from the data before it reaches the backend LLM. DataFilter is trained with supervised fine-tuning on simulated injections and leverages both the user's instruction and the data to selectively strip adversarial content while preserving benign information. Across multiple benchmarks, DataFilter consistently reduces the prompt injection attack success rates to near zero while maintaining the LLMs' utility. DataFilter delivers strong security, high utility, and plug-and-play deployment, making it a strong practical defense to secure black-box commercial LLMs against prompt injection. Our DataFilter model is released at https://huggingface.co/JoyYizhu/DataFilter for immediate use, with the code to reproduce our results at https://github.com/yizhu-joy/DataFilter.

🔍 Key Points

Introduction of DataFilter, a model-agnostic defense against prompt injection attacks, which effectively removes malicious instructions from untrusted data while preserving benign information.
Empirical results demonstrate that DataFilter reduces attack success rates (ASR) to near zero without significant loss of utility, outperforming existing defenses and providing a better security-utility trade-off.
DataFilter does not require access to model weights, making it practical for deployment in black-box settings, thus appealing to organizations using commercial LLMs without retraining options.
The paper highlights the importance of generalization in defenses against various prompt injection techniques, indicating that DataFilter has been successfully trained on diverse attack types and can recognize unseen attacks.
Release of the DataFilter model and code for easy integration and immediate use by practitioners in securing LLM applications.

💡 Why This Paper Matters

The work presented in this paper addresses a significant security threat posed by prompt injection attacks on LLMs, offering a robust, easy-to-deploy solution that balances high security with utility preservation. DataFilter's innovative approach and the results demonstrate its potential to protect AI agents operating in real-world conditions, making it a critical contribution to the field of AI security.

🎯 Why It's Interesting for AI Security Researchers

This paper is of particular interest to AI security researchers due to its focus on a prevalent vulnerability in LLM applications, providing insights into effective defensive strategies. The introduction of a practical, model-agnostic approach like DataFilter enhances understanding of how to mitigate exploitation risks in AI systems, promoting further exploration and development of secure AI applications.

Defending Against Prompt Injection with DataFilter

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper