System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection

📄 Abstract

Large language models (LLMs) have gained widespread adoption across diverse applications due to their impressive generative capabilities. Their plug-and-play nature enables both developers and end users to interact with these models through simple prompts. However, as LLMs become more integrated into various systems in diverse domains, concerns around their security are growing. Existing studies mainly focus on threats arising from user prompts (e.g. prompt injection attack) and model output (e.g. model inversion attack), while the security of system prompts remains largely overlooked. This work bridges the critical gap. We introduce system prompt poisoning, a new attack vector against LLMs that, unlike traditional user prompt injection, poisons system prompts hence persistently impacts all subsequent user interactions and model responses. We systematically investigate four practical attack strategies in various poisoning scenarios. Through demonstration on both generative and reasoning LLMs, we show that system prompt poisoning is highly feasible without requiring jailbreak techniques, and effective across a wide range of tasks, including those in mathematics, coding, logical reasoning, and natural language processing. Importantly, our findings reveal that the attack remains effective even when user prompts employ advanced prompting techniques like chain-of-thought (CoT). We also show that such techniques, including CoT and retrieval-augmentation-generation (RAG), which are proven to be effective for improving LLM performance in a wide range of tasks, are significantly weakened in their effectiveness by system prompt poisoning.

🔍 Key Points

Introduction of system prompt poisoning (SPP) as a novel attack vector against large language models (LLMs) that compromises the integrity of model outputs permanently through malicious alteration of system prompts.
Evaluation of three practical attack strategies (brute-force poisoning, adaptive in-context poisoning, adaptive chain-of-thought poisoning) demonstrating consistently severe effects across multiple reasoning and coding tasks.
Development of Auto-SPP, an automated framework for generating poisoned system prompts, showcasing the efficiency of the attack with low costs and fast execution times.
Empirical findings indicating that system prompt poisoning significantly diminishes the effectiveness of user prompts and advanced prompting techniques like chain-of-thought prompting, further emphasizing the seriousness of the vulnerability.
Discussion of defense mechanisms, including the need for integrity monitoring and conflict detection in system prompts, to mitigate the risks posed by system prompt poisoning.

💡 Why This Paper Matters

This paper is crucial in identifying a critical security vulnerability in large language models, focusing on system prompt poisoning and demonstrating its potentially devastating implications for LLM applications. The systematic evaluation of attack strategies and their effectiveness across various tasks underscores the urgent need for improved security measures, thus contributing meaningfully to the field of AI security.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper relevant as it uncovers a significant and previously overlooked attack vector (system prompt poisoning) that can bypass existing defenses, posing a substantial risk to the integrity of AI applications. The paper not only details successful attack methodologies but also emphasizes the failings of current defenses, prompting further investigation and development of robust security protocols in AI systems.

System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper