← Back to Library

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Authors: Rishika Bhagwatkar, Kevin Kasa, Abhay Puri, Gabriel Huang, Irina Rish, Graham W. Taylor, Krishnamurthy Dj Dvijotham, Alexandre Lacoste

Published: 2025-10-06

arXiv ID: 2510.05244v1

Added to Library: 2025-11-17 01:01 UTC

Red Teaming

📄 Abstract

AI agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cause unintended or harmful behavior. Inspired by the well-established concept of firewalls, we show that a simple, modular and model-agnostic defense operating at the agent--tool interface achieves perfect security (0% or the lowest possible attack success rate) with high utility (task success rate) across four public benchmarks: AgentDojo, Agent Security Bench, InjecAgent and tau-Bench, while achieving a state-of-the-art security-utility tradeoff compared to prior results. Specifically, we employ a defense based on two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer). Unlike prior complex approaches, this firewall defense makes minimal assumptions on the agent and can be deployed out-of-the-box, while maintaining strong performance without compromising utility. However, our analysis also reveals critical limitations in these existing benchmarks, including flawed success metrics, implementation bugs, and most importantly, weak attacks, hindering significant progress in the field. To foster more meaningful progress, we present targeted fixes to these issues for AgentDojo and Agent Security Bench while proposing best-practices for more robust benchmark design. Further, we demonstrate that although these firewalls push the state-of-the-art on existing benchmarks, it is still possible to bypass them in practice, underscoring the need to incorporate stronger attacks in security benchmarks. Overall, our work shows that existing agentic security benchmarks are easily saturated by a simple approach and highlights the need for stronger agentic security benchmarks with carefully chosen evaluation metrics and strong adaptive attacks.

🔍 Key Points

  • The paper introduces a modular and model-agnostic defense against indirect prompt injection attacks using two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer), achieving near-zero attack success rates across multiple benchmarks.
  • The firewalls provide a significant improvement in security-utility tradeoff without requiring complex or proprietary methods, thereby making them easily deployable in real-world systems.
  • Critical analyses of existing benchmarks reveal flaws in attack methodologies and evaluation metrics, prompting the authors to propose targeted fixes for AgentDojo and Agent Security Bench, and to emphasize the need for stronger attack strategies in datasets.
  • The paper demonstrates that despite the effectiveness of their firewalls, it remains possible to bypass them through cleverly crafted attacks, thus motivating continuous improvement and adaptation.
  • The findings suggest that current benchmarks are insufficient for stressing the capabilities of defenses and emphasize the necessity for more comprehensive and dynamic evaluation frameworks.

💡 Why This Paper Matters

This paper presents significant advancements in the defense of AI agents against indirect prompt injection attacks through innovative firewall mechanisms. It not only enhances security but also identifies and addresses existing shortcomings in benchmarking methodologies, setting a new standard for future research in the field. The implications of this work are crucial for improving the robustness of AI systems in practice, as they are increasingly integrated into sensitive applications.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper highlights critical vulnerabilities in existing AI frameworks and offers novel defense mechanisms that significantly enhance system resilience. The comprehensive evaluation of benchmarks speaks to the evolving nature of attack methodologies, prompting a reevaluation of how AI systems are tested and secured. As the landscape of AI threats continues to evolve, these insights and advancements are essential for developing effective and sustainable AI security strategies.

📚 Read the Full Paper