← Back to Library

AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

Authors: Tian Zhang, Yiwei Xu, Juan Wang, Keyan Guo, Xiaoyang Xu, Bowen Xiao, Quanlong Guan, Jinlin Fan, Jiawei Liu, Zhiquan Liu, Hongxin Hu

Published: 2026-02-26

arXiv ID: 2602.22724v1

Added to Library: 2026-02-27 03:00 UTC

Red Teaming

📄 Abstract

Large language model (LLM) agents increasingly rely on external tools and retrieval systems to autonomously complete complex tasks. However, this design exposes agents to indirect prompt injection (IPI), where attacker-controlled context embedded in tool outputs or retrieved content silently steers agent actions away from user intent. Unlike prompt-based attacks, IPI unfolds over multi-turn trajectories, making malicious control difficult to disentangle from legitimate task execution. Existing inference-time defenses primarily rely on heuristic detection and conservative blocking of high-risk actions, which can prematurely terminate workflows or broadly suppress tool usage under ambiguous multi-turn scenarios. We propose AgentSentry, a novel inference-time detection and mitigation framework for tool-augmented LLM agents. To the best of our knowledge, AgentSentry is the first inference-time defense to model multi-turn IPI as a temporal causal takeover. It localizes takeover points via controlled counterfactual re-executions at tool-return boundaries and enables safe continuation through causally guided context purification that removes attack-induced deviations while preserving task-relevant evidence. We evaluate AgentSentry on the \textsc{AgentDojo} benchmark across four task suites, three IPI attack families, and multiple black-box LLMs. AgentSentry eliminates successful attacks and maintains strong utility under attack, achieving an average Utility Under Attack (UA) of 74.55 %, improving UA by 20.8 to 33.6 percentage points over the strongest baselines without degrading benign performance.

🔍 Key Points

  • Introduction of AgentSentry, a framework for mitigating indirect prompt injection (IPI) in tool-augmented LLM agents, modeling IPI as a temporal causal takeover process.
  • Utilization of controlled counterfactual re-executions at tool-return boundaries for diagnosing takeover points and enabling safe task continuation.
  • Implementation of causally gated context purification that suppresses attack-induced signals while preserving necessary contextual information for task completion.
  • Demonstrated effective performance through comprehensive evaluations on the AgentDojo benchmark, achieving 0% attack success rate while maintaining an average Utility Under Attack (UA) of 74.55%, significantly outperforming existing defenses.
  • Addressing multi-turn workflows that are critical in real-world applications, providing a systematic way of defending against sophisticated multi-turn attacks that exploit temporal dependencies.

💡 Why This Paper Matters

The research presented in this paper is critical in advancing the security of LLM agents. The framework AgentSentry not only provides a novel approach to detecting and mitigating indirect prompt injection attacks but also preserves the utility and functionality of the agents, making it a vital contribution to the ongoing development of secure AI systems. By effectively managing the complexities of multi-turn interactions in LLMs, this work lays foundational methodologies for future security enhancements in AI applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant as it tackles one of the pressing challenges in the field: the vulnerability of AI language models to sophisticated injection attacks. The innovative approaches proposed, such as temporal causal diagnostics and context purification, represent significant advancements in the defense mechanisms available for LLMs. Given the increasing reliance on AI in sensitive applications, ensuring robust security measures is essential, and this paper provides a substantial framework for addressing these vulnerabilities.

📚 Read the Full Paper