← Back to Library

Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment

Authors: Zedian Shao, Hongbin Liu, Jaden Mu, Neil Zhenqiang Gong

Published: 2024-10-18

arXiv ID: 2410.14827v3

Added to Library: 2025-11-11 14:05 UTC

Red Teaming

📄 Abstract

Prompt injection attack, where an attacker injects a prompt into the original one, aiming to make an Large Language Model (LLM) follow the injected prompt to perform an attacker-chosen task, represent a critical security threat. Existing attacks primarily focus on crafting these injections at inference time, treating the LLM itself as a static target. Our experiments show that these attacks achieve some success, but there is still significant room for improvement. In this work, we introduces a more foundational attack vector: poisoning the LLM's alignment process to amplify the success of future prompt injection attacks. Specifically, we propose PoisonedAlign, a method that strategically creates poisoned alignment samples to poison an LLM's alignment dataset. Our experiments across five LLMs and two alignment datasets show that when even a small fraction of the alignment data is poisoned, the resulting model becomes substantially more vulnerable to a wide range of prompt injection attacks. Crucially, this vulnerability is instilled while the LLM's performance on standard capability benchmarks remains largely unchanged, making the manipulation difficult to detect through automated, general-purpose performance evaluations. The code for implementing the attack is available at https://github.com/Sadcardation/PoisonedAlign.

🔍 Key Points

  • The introduction of PoisonedAlign, a novel method that poisons the alignment process of LLMs to enhance their susceptibility to prompt injection attacks.
  • Demonstrated that even a small percentage of poisoned data (10%) in alignment datasets can significantly increase vulnerability to prompt injection while maintaining standard model performance on capability benchmarks.
  • Showed robustness across five different LLMs and two alignment datasets with various task combinations indicating versatility in the attack method.
  • Conducted extensive evaluation using metrics like Attack Success Value (ASV) to quantify the effectiveness of the method and its stealthiness against detection measures.
  • Provided insights into existing defenses, revealing that currently proposed solutions like BEAT and BAIT struggle against the new threat introduced by PoisonedAlign.

💡 Why This Paper Matters

This paper is relevant and important as it addresses a critical security vulnerability in large language models that could have serious implications for AI applications dealing with untrusted or adversarial inputs. By demonstrating how poisoning the alignment process can lead to amplified risks of prompt injection attacks, the research highlights the need for more robust defense mechanisms and raises awareness about the potential misuse of LLMs through subtle data poisons.

🎯 Why It's Interesting for AI Security Researchers

The insights from this research are crucial for AI security researchers as it delves into a promising attack vector that operates at the alignment stage rather than just inference. Understanding and mitigating such risks is essential for developing secure AI systems, particularly as LLMs are increasingly integrated into applications that handle sensitive or critical data. The findings call for renewed scrutiny of training data sources and enhanced methods for ensuring robustness against sophisticated attack strategies.

📚 Read the Full Paper