DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

Authors: Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong

Published: 2025-04-15

arXiv ID: 2504.11358v3

Added to Library: 2025-11-11 14:02 UTC

Red Teaming

📄 Abstract

LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks.

🔍 Key Points

Introduction of DataSentinel, a game-theoretic method that enhances prompt injection attack detection using fine-tuned LLMs.
Formulation of detection as a minimax optimization problem considering both detection LLM fine-tuning and adaptive attacks.
Demonstrated effectiveness of DataSentinel through evaluations on diverse benchmark datasets and multiple LLMs, achieving near-zero false positive and negative rates.
Showcased significant performance improvements over existing baseline methods, particularly for adaptive prompt injection attacks, indicating practical application for real-world LLM integrations.

💡 Why This Paper Matters

This paper introduces a novel approach to detecting prompt injection attacks using game-theoretic principles. By fine-tuning LLMs to discern clean from contaminated data, the authors provide a robust defense mechanism that adapts to evolving attack strategies. The effectiveness of DataSentinel across various tasks highlights its potential impact on enhancing the security of LLM-integrated applications.

🎯 Why It's Interesting for AI Security Researchers

The research is crucial for AI security researchers focused on ensuring the integrity and reliability of LLM applications. As prompt injection attacks become more sophisticated, understanding and mitigating these vulnerabilities are paramount for developing secure AI systems. The paper's innovative methodology and promising results present valuable insights and tools for the AI security community.

DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper