Defense Against Indirect Prompt Injection via Tool Result Parsing

📄 Abstract

As LLM agents transition from digital assistants to physical controllers in autonomous systems and robotics, they face an escalating threat from indirect prompt injection. By embedding adversarial instructions into the results of tool calls, attackers can hijack the agent's decision-making process to execute unauthorized actions. This vulnerability poses a significant risk as agents gain more direct control over physical environments. Existing defense mechanisms against Indirect Prompt Injection (IPI) generally fall into two categories. The first involves training dedicated detection models; however, this approach entails high computational overhead for both training and inference, and requires frequent updates to keep pace with evolving attack vectors. Alternatively, prompt-based methods leverage the inherent capabilities of LLMs to detect or ignore malicious instructions via prompt engineering. Despite their flexibility, most current prompt-based defenses suffer from high Attack Success Rates (ASR), demonstrating limited robustness against sophisticated injection attacks. In this paper, we propose a novel method that provides LLMs with precise data via tool result parsing while effectively filtering out injected malicious code. Our approach achieves competitive Utility under Attack (UA) while maintaining the lowest Attack Success Rate (ASR) to date, significantly outperforming existing methods. Code is available at GitHub.

🔍 Key Points

Proposes a novel defense mechanism against Indirect Prompt Injection (IPI) attacks using tool result parsing to filter out malicious content before it reaches Large Language Models (LLMs).
Introduces two main modules: ParseData for extracting necessary data from tool results while enforcing format and logical constraints, and CheckTool for monitoring and sanitizing large text content retrieved from tools.
Demonstrates that the proposed defense achieves the lowest Attack Success Rate (ASR) among existing methods while maintaining competitive Utility under Attack (UA), reflected through extensive experiments on the AgentDojo benchmark.
Highlights the importance of avoiding computational overhead associated with existing model-based defenses, favoring a prompt-based approach that can adapt as LLMs evolve.
Acknowledges limitations regarding parameter hijacking in IPI, indicating the need for further research in broader defense strategies.

💡 Why This Paper Matters

This paper provides a significant advancement in the security of Large Language Models, particularly in their application in autonomous systems. By offering a robust defense against indirect prompt injection attacks, it enhances the safety and reliability of LLMs when interacting with external tools, which is crucial for the deployment of autonomous agents in real-world applications involving critical decision-making processes.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper relevant as it addresses a growing vulnerability in LLMs related to indirect prompt injection. The novel defense mechanism outlined in the work not only contributes to the discourse on securing AI systems but also sets a foundation for future research into more comprehensive security measures, particularly in safeguarding LLM integrations in various environments, including robotics and automated control systems.

Defense Against Indirect Prompt Injection via Tool Result Parsing

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper