← Back to Library

Goal-guided Generative Prompt Injection Attack on Large Language Models

Authors: Chong Zhang, Mingyu Jin, Qinkai Yu, Chengzhi Liu, Haochen Xue, Xiaobo Jin

Published: 2024-04-06

arXiv ID: 2404.07234v4

Added to Library: 2025-11-11 14:27 UTC

Red Teaming

📄 Abstract

Current large language models (LLMs) provide a strong foundation for large-scale user-oriented natural language tasks. A large number of users can easily inject adversarial text or instructions through the user interface, thus causing LLMs model security challenges. Although there is currently a large amount of research on prompt injection attacks, most of these black-box attacks use heuristic strategies. It is unclear how these heuristic strategies relate to the success rate of attacks and thus effectively improve model robustness. To solve this problem, we redefine the goal of the attack: to maximize the KL divergence between the conditional probabilities of the clean text and the adversarial text. Furthermore, we prove that maximizing the KL divergence is equivalent to maximizing the Mahalanobis distance between the embedded representation $x$ and $x'$ of the clean text and the adversarial text when the conditional probability is a Gaussian distribution and gives a quantitative relationship on $x$ and $x'$. Then we designed a simple and effective goal-guided generative prompt injection strategy (G2PIA) to find an injection text that satisfies specific constraints to achieve the optimal attack effect approximately. It is particularly noteworthy that our attack method is a query-free black-box attack method with low computational cost. Experimental results on seven LLM models and four datasets show the effectiveness of our attack method.

🔍 Key Points

  • Proposes a novel goal-guided generative prompt injection attack (G2PIA) targeting large language models (LLMs), which redefines attack objectives to maximize KL-divergence between clean and adversarial text.
  • Establishes a theoretical framework proving that maximizing KL-divergence is equivalent to maximizing Mahalanobis distance under Gaussian assumptions, allowing for a quantitative assessment of clean and adversarial text relationships.
  • Demonstrates an effective, query-free black-box attack method across multiple state-of-the-art LLMs, showcasing significant attack success rates and minimal computational costs in experiments involving seven different models and four datasets.
  • Highlights the limitations of existing heuristic-based prompt injection strategies by providing a structured methodology that statistically ensures higher attack success rates and better robustness analysis.
  • Presents comprehensive experimental results that underline the efficacy of the proposed attack method, revealing vulnerabilities in widely used LLMs and suggesting implications for real-world applications.

💡 Why This Paper Matters

This paper provides crucial insights into the security vulnerabilities of large language models, detailing a systematic approach to adversarial attacks that can severely compromise their performance. By refining attack methodologies and establishing a solid theoretical foundation, the work opens pathways for improving the robustness of these models against adversarial threats. The findings serve as a wake-up call for developers and researchers to address potential exploitations in AI systems before they are deployed in sensitive environments.

🎯 Why It's Interesting for AI Security Researchers

The research is particularly relevant for AI security researchers as it directly tackles the emerging threats posed by adversarial attacks on large language models, a growing concern in the field of AI safety. By presenting a novel attack methodology and revealing concrete vulnerabilities in popular models, the paper underscores the need for enhanced security measures and inspires further investigation into robust defense mechanisms.

📚 Read the Full Paper