Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

Authors: Mintong Kang, Chong Xiang, Sanjay Kariyappa, Chaowei Xiao, Bo Li, Edward Suh

Published: 2025-11-30

arXiv ID: 2512.00966v1

Added to Library: 2025-12-02 03:00 UTC

Red Teaming

📄 Abstract

Indirect prompt injection attacks (IPIAs), where large language models (LLMs) follow malicious instructions hidden in input data, pose a critical threat to LLM-powered agents. In this paper, we present IntentGuard, a general defense framework based on instruction-following intent analysis. The key insight of IntentGuard is that the decisive factor in IPIAs is not the presence of malicious text, but whether the LLM intends to follow instructions from untrusted data. Building on this insight, IntentGuard leverages an instruction-following intent analyzer (IIA) to identify which parts of the input prompt the model recognizes as actionable instructions, and then flag or neutralize any overlaps with untrusted data segments. To instantiate the framework, we develop an IIA that uses three "thinking intervention" strategies to elicit a structured list of intended instructions from reasoning-enabled LLMs. These techniques include start-of-thinking prefilling, end-of-thinking refinement, and adversarial in-context demonstration. We evaluate IntentGuard on two agentic benchmarks (AgentDojo and Mind2Web) using two reasoning-enabled LLMs (Qwen-3-32B and gpt-oss-20B). Results demonstrate that IntentGuard achieves (1) no utility degradation in all but one setting and (2) strong robustness against adaptive prompt injection attacks (e.g., reducing attack success rates from 100% to 8.5% in a Mind2Web scenario).

🔍 Key Points

Introduction of IntentGuard, a framework for mitigating indirect prompt injection attacks (IPIAs) by leveraging instruction-following intent analysis.
The development of an instruction-following intent analyzer (IIA) that utilizes 'thinking intervention' strategies to extract intended instructions from reasoning-enabled large language models (LLMs).
Robust evaluation results demonstrating that IntentGuard maintains utility and significantly reduces attack success rates under adaptive prompt injection scenarios, such as reducing success rates from 100% to 8.5%.
A flexible framework that supports various IIAs, allowing adaptation and integration with different LLM architectures and defense strategies.
Innovative techniques for combining structured reasoning interventions that enhance the defense's robustness and the model's instruction-following capabilities.

💡 Why This Paper Matters

This paper presents a novel approach to addressing a significant security vulnerability in LLM-powered systems—indirect prompt injection attacks. By focusing on the model's intention to follow instructions and creating mechanisms to effectively analyze and mitigate such intentions, IntentGuard represents a substantial contribution to the field of AI security. The results not only highlight the effectiveness of the proposed defense mechanisms but also open avenues for future adaptations and research in protecting AI systems from manipulation and unauthorized directives.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it tackles a pressing issue in the deployment of large language models: their susceptibility to prompt injection attacks. With the increasing integration of LLMs in various applications, understanding and mitigating these security threats is crucial. Researchers in AI security can benefit from the novel methodology of instruction analysis and the framework proposed, which provides a new paradigm for addressing similar vulnerabilities across AI systems.

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper