PromptShield: Deployable Detection for Prompt Injection Attacks

Authors: Dennis Jacob, Hend Alzahrani, Zhanhao Hu, Basel Alomair, David Wagner

Published: 2025-01-25

arXiv ID: 2501.15145v2

Added to Library: 2025-11-11 14:02 UTC

📄 Abstract

Application designers have moved to integrate large language models (LLMs) into their products. However, many LLM-integrated applications are vulnerable to prompt injections. While attempts have been made to address this problem by building prompt injection detectors, many are not yet suitable for practical deployment. To support research in this area, we introduce PromptShield, a benchmark for training and evaluating deployable prompt injection detectors. Our benchmark is carefully curated and includes both conversational and application-structured data. In addition, we use insights from our curation process to fine-tune a new prompt injection detector that achieves significantly higher performance in the low false positive rate (FPR) evaluation regime compared to prior schemes. Our work suggests that careful curation of training data and larger models can contribute to strong detector performance.

🔍 Key Points

Introduction of DataSentinel, a game-theoretic method that enhances prompt injection attack detection using fine-tuned LLMs.
Formulation of detection as a minimax optimization problem considering both detection LLM fine-tuning and adaptive attacks.
Demonstrated effectiveness of DataSentinel through evaluations on diverse benchmark datasets and multiple LLMs, achieving near-zero false positive and negative rates.
Showcased significant performance improvements over existing baseline methods, particularly for adaptive prompt injection attacks, indicating practical application for real-world LLM integrations.

💡 Why This Paper Matters

This paper introduces a novel approach to detecting prompt injection attacks using game-theoretic principles. By fine-tuning LLMs to discern clean from contaminated data, the authors provide a robust defense mechanism that adapts to evolving attack strategies. The effectiveness of DataSentinel across various tasks highlights its potential impact on enhancing the security of LLM-integrated applications.

🎯 Why It's Interesting for AI Security Researchers

The research is crucial for AI security researchers focused on ensuring the integrity and reliability of LLM applications. As prompt injection attacks become more sophisticated, understanding and mitigating these vulnerabilities are paramount for developing secure AI systems. The paper's innovative methodology and promising results present valuable insights and tools for the AI security community.

PromptShield: Deployable Detection for Prompt Injection Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper