← Back to Library

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Authors: Minbeom Kim, Mihir Parmar, Phillip Wallis, Lesly Miculicich, Kyomin Jung, Krishnamurthy Dj Dvijotham, Long T. Le, Tomas Pfister

Published: 2026-02-08

arXiv ID: 2602.07918v1

Added to Library: 2026-02-10 05:01 UTC

Red Teaming

📄 Abstract

AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks. In this attack scenario, malicious commands hidden within untrusted content trick the agent into performing unauthorized actions. Existing defenses can reduce attack success but often suffer from the over-defense dilemma: they deploy expensive, always-on sanitization regardless of actual threat, thereby degrading utility and latency even in benign scenarios. We revisit IPI through a causal ablation perspective: a successful injection manifests as a dominance shift where the user request no longer provides decisive support for the agent's privileged action, while a particular untrusted segment, such as a retrieved document or tool output, provides disproportionate attributable influence. Based on this signature, we propose CausalArmor, a selective defense framework that (i) computes lightweight, leave-one-out ablation-based attributions at privileged decision points, and (ii) triggers targeted sanitization only when an untrusted segment dominates the user intent. Additionally, CausalArmor employs retroactive Chain-of-Thought masking to prevent the agent from acting on ``poisoned'' reasoning traces. We present a theoretical analysis showing that sanitization based on attribution margins conditionally yields an exponentially small upper bound on the probability of selecting malicious actions. Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses while improving explainability and preserving utility and latency of AI agents.

🔍 Key Points

  • CausalArmor introduces a method for detecting Indirect Prompt Injection attacks through causal attribution, specifically utilizing a dominance shift metric to identify when an untrusted segment influences agent decisions more than user requests.
  • The paper proposes a selective defense mechanism that triggers sanitization only in cases of detected dominance shifts, contrasting the always-on sanitization methods prevalent in existing defenses and thereby reducing latency and utility loss during normal operations.
  • CausalArmor employs retroactive Chain-of-Thought (CoT) masking to eliminate 'poisoned' reasoning from previous agent outputs, enhancing overall security by thoroughly addressing possible residual influence from unsanitized inputs.
  • Empirical evaluations on benchmarks such as AgentDojo and DoomArena demonstrate CausalArmor achieves near-zero attack success rates while maintaining benign utility and latency levels comparable to no defense scenarios.
  • The framework provides a theoretical guarantee on sanitization effectiveness, indicating that proper execution can lead to exponentially reduced probabilities of malicious actions being taken.

💡 Why This Paper Matters

The paper presents CausalArmor, a significant advancement in the defense mechanisms against Indirect Prompt Injection attacks on AI tool-calling agents. By effectively utilizing a combination of lightweight causal attribution and selective sanitization, it addresses the crucial challenge of maintaining operational efficiency while ensuring enhanced security. This approach mitigates the over-defense dilemma often faced by existing systems, thus improving explainability and real-world applicability of AI agents operating in environments with untrusted content.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant for AI security researchers as it tackles a pressing vulnerability in AI systems that are increasingly deployed in real-world applications. The proposed methods not only push the boundaries of current defenses against prompt injection attacks but also introduce novel theoretical insights and empirical results that can inspire future research. Furthermore, the framework offers practical solutions to enhance the safety of AI agents, thus contributing to the ongoing discourse on securing AI technologies and their deployment in sensitive domains.

📚 Read the Full Paper