PromptArmor: Simple yet Effective Prompt Injection Defenses

Authors: Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, Dawn Song

Published: 2025-07-21

arXiv ID: 2507.15219v1

Added to Library: 2025-11-11 14:02 UTC

📄 Abstract

Despite their potential, recent research has demonstrated that LLM agents are vulnerable to prompt injection attacks, where malicious prompts are injected into the agent's input, causing it to perform an attacker-specified task rather than the intended task provided by the user. In this paper, we present PromptArmor, a simple yet effective defense against prompt injection attacks. Specifically, PromptArmor prompts an off-the-shelf LLM to detect and remove potential injected prompts from the input before the agent processes it. Our results show that PromptArmor can accurately identify and remove injected prompts. For example, using GPT-4o, GPT-4.1, or o4-mini, PromptArmor achieves both a false positive rate and a false negative rate below 1% on the AgentDojo benchmark. Moreover, after removing injected prompts with PromptArmor, the attack success rate drops to below 1%. We also demonstrate PromptArmor's effectiveness against adaptive attacks and explore different strategies for prompting an LLM. We recommend that PromptArmor be adopted as a standard baseline for evaluating new defenses against prompt injection attacks.

🔍 Key Points

Introduction of DataSentinel, a game-theoretic method that enhances prompt injection attack detection using fine-tuned LLMs.
Formulation of detection as a minimax optimization problem considering both detection LLM fine-tuning and adaptive attacks.
Demonstrated effectiveness of DataSentinel through evaluations on diverse benchmark datasets and multiple LLMs, achieving near-zero false positive and negative rates.
Showcased significant performance improvements over existing baseline methods, particularly for adaptive prompt injection attacks, indicating practical application for real-world LLM integrations.

💡 Why This Paper Matters

This paper introduces a novel approach to detecting prompt injection attacks using game-theoretic principles. By fine-tuning LLMs to discern clean from contaminated data, the authors provide a robust defense mechanism that adapts to evolving attack strategies. The effectiveness of DataSentinel across various tasks highlights its potential impact on enhancing the security of LLM-integrated applications.

🎯 Why It's Interesting for AI Security Researchers

The research is crucial for AI security researchers focused on ensuring the integrity and reliability of LLM applications. As prompt injection attacks become more sophisticated, understanding and mitigating these vulnerabilities are paramount for developing secure AI systems. The paper's innovative methodology and promising results present valuable insights and tools for the AI security community.

PromptArmor: Simple yet Effective Prompt Injection Defenses

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper