Can Indirect Prompt Injection Attacks Be Detected and Removed?

Authors: Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, Bryan Hooi

Published: 2025-02-23

arXiv ID: 2502.16580v5

Added to Library: 2025-11-11 14:04 UTC

Red Teaming

📄 Abstract

Prompt injection attacks manipulate large language models (LLMs) by misleading them to deviate from the original input instructions and execute maliciously injected instructions, because of their instruction-following capabilities and inability to distinguish between the original input instructions and maliciously injected instructions. To defend against such attacks, recent studies have developed various detection mechanisms. If we restrict ourselves specifically to works which perform detection rather than direct defense, most of them focus on direct prompt injection attacks, while there are few works for the indirect scenario, where injected instructions are indirectly from external tools, such as a search engine. Moreover, current works mainly investigate injection detection methods and pay less attention to the post-processing method that aims to mitigate the injection after detection. In this paper, we investigate the feasibility of detecting and removing indirect prompt injection attacks, and we construct a benchmark dataset for evaluation. For detection, we assess the performance of existing LLMs and open-source detection models, and we further train detection models using our crafted training datasets. For removal, we evaluate two intuitive methods: (1) the segmentation removal method, which segments the injected document and removes parts containing injected instructions, and (2) the extraction removal method, which trains an extraction model to identify and remove injected instructions.

🔍 Key Points

The authors identify a research gap in the detection and removal of indirect prompt injection attacks in large language models (LLMs).
They propose two removal methods: a segmentation approach that removes parts containing injected instructions based on segment classification, and an extraction approach that identifies and removes the injected content using a trained model.
The study presents a benchmark dataset specifically developed to evaluate the effectiveness of detection and removal methods against indirect prompt injection attacks.
Experimental results show that existing models struggle with detection, while newly trained models using crafted datasets perform significantly better.
The combination of detection and removal techniques demonstrates improved defense performance compared to prior prompt-engineering and fine-tuning methods.

💡 Why This Paper Matters

This paper is relevant in the ever-evolving landscape of AI security, particularly as LLMs are increasingly integrated into applications that can be exploited through prompt injection. By addressing both the detection and mitigation of indirect prompt injection attacks, the authors provide critical insights and practical solutions for enhancing the security of LLMs against emerging threats, making their findings significant for researchers and developers in this field.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would be particularly interested in this paper as it tackles the urgent issue of prompt injection attacks that can compromise LLMs. The study's novel methodologies for detection and removal not only contribute to the theoretical understanding of adversarial threats in AI systems but also provide tangible strategies for mitigating such risks in practical applications. Given the growing reliance on LLMs across various sectors, ensuring their robustness against such attacks is crucial, making this work a vital contribution to the field.

Can Indirect Prompt Injection Attacks Be Detected and Removed?

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper