ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Authors: Che Wang, Fuyao Zhang, Jiaming Zhang, Ziqi Zhang, Yinghui Wang, Longtao Huang, Jianbo Gao, Zhong Chen, Wei Yang Bryan Lim

Published: 2026-02-24

arXiv ID: 2602.20708v1

Added to Library: 2026-02-25 03:02 UTC

Red Teaming

📄 Abstract

Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows. We propose ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity. Our key insight is that IPI attacks leave distinct over-focusing signatures in the latent space. We introduce a Latent Space Trace Prober to detect attacks based on high intensity scores. Subsequently, a Mitigating Rectifier performs surgical attention steering that selectively manipulate adversarial query key dependencies while amplifying task relevant elements to restore the LLM's functional trajectory. Extensive evaluations on multiple backbones show that ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain. Furthermore, ICON demonstrates robust Out of Distribution(OOD) generalization and extends effectively to multi-modal agents, establishing a superior balance between security and efficiency.

🔍 Key Points

ICON proposes a novel probing-to-mitigation framework that effectively neutralizes Indirect Prompt Injection (IPI) attacks while maintaining the task utility of LLM agents.
The framework introduces the Focus Intensity Score (FIS) to identify attention anomalies, facilitating real-time detection of adversarial manipulations in the latent space.
The Latent Space Trace Prober (LSTP) and Mitigating Rectifier (MR) work synergistically to detect IPI attacks and rectify the agent's trajectory without compromising functionality or causing over-refusals.
Extensive evaluations demonstrate that ICON achieves a competitive Attack Success Rate (ASR) of 0.4% and an average utility gain of over 50% compared to other defenses, illustrating its effectiveness in maintaining operational integrity.
ICON shows robustness across multiple backbones and demonstrates exceptional out-of-distribution generalization, suggesting its adaptability to evolving adversarial techniques.

💡 Why This Paper Matters

ICON represents a significant advancement in the defense mechanisms against indirect prompt injection attacks, addressing a pressing vulnerability in large language model agents. Its innovative approach not only enhances security but also preserves task continuity, making it critical for the real-world application of LLMs in various industries. This dual focus on security and efficiency speaks to the practical needs of deploying AI safely and effectively in dynamic environments.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it tackles new and sophisticated threats posed by indirect prompt injection attacks, which are increasingly relevant as LLM agents become more autonomous. By providing a novel framework that uniquely balances security with operational efficiency, the findings contribute essential knowledge to the evolving discourse on adversarial robustness in AI systems, helping enhance the safety and reliability of AI applications.

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper