← Back to Library

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods

Authors: Yulin Chen, Haoran Li, Yuan Sui, Yangqiu Song, Bryan Hooi

Published: 2025-10-04

arXiv ID: 2510.03705v1

Added to Library: 2025-11-11 14:01 UTC

Red Teaming

📄 Abstract

With the development of technology, large language models (LLMs) have dominated the downstream natural language processing (NLP) tasks. However, because of the LLMs' instruction-following abilities and inability to distinguish the instructions in the data content, such as web pages from search engines, the LLMs are vulnerable to prompt injection attacks. These attacks trick the LLMs into deviating from the original input instruction and executing the attackers' target instruction. Recently, various instruction hierarchy defense strategies are proposed to effectively defend against prompt injection attacks via fine-tuning. In this paper, we explore more vicious attacks that nullify the prompt injection defense methods, even the instruction hierarchy: backdoor-powered prompt injection attacks, where the attackers utilize the backdoor attack for prompt injection attack purposes. Specifically, the attackers poison the supervised fine-tuning samples and insert the backdoor into the model. Once the trigger is activated, the backdoored model executes the injected instruction surrounded by the trigger. We construct a benchmark for comprehensive evaluation. Our experiments demonstrate that backdoor-powered prompt injection attacks are more harmful than previous prompt injection attacks, nullifying existing prompt injection defense methods, even the instruction hierarchy techniques.

🔍 Key Points

  • Introduction of backdoor-powered prompt injection attacks that exploit vulnerabilities in large language models (LLMs), demonstrating a novel attack vector that combines backdoor and prompt injection techniques.
  • Demonstration that existing instruction hierarchy defense methods are ineffective against these new types of attacks, rendering previous defensive strategies obsolete.
  • Development of a comprehensive benchmark consisting of various tasks (phishing, advertisement, etc.) to evaluate the effectiveness of attacks and defenses, providing a robust framework for future research.
  • Detailed experimental results showing high attack success rates (ASR) for backdoor-powered prompt injection attacks, even against models defended by state-of-the-art techniques.
  • Insights into the influence of backdoor poison rate on attack effectiveness and the minimal impact of backdooring on model utility.

💡 Why This Paper Matters

This paper is significant as it highlights a critical and emerging threat in AI security, specifically in the context of large language models. By revealing the limitations of existing defense mechanisms, it calls for the community to re-evaluate strategies to ensure the integrity and security of LLM applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant due to its exploration of novel attack methodologies that can undermine widely adopted defense mechanisms. The insights into how backdoor attacks can exploit instruction-following capabilities in LLMs provoke further investigation into robust security measures for AI systems, making it a pivotal work in the field of AI safety.

📚 Read the Full Paper