← Back to Library

Attention Tracker: Detecting Prompt Injection Attacks in LLMs

Authors: Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I-Hsin Chung, Winston H. Hsu, Pin-Yu Chen

Published: 2024-11-01

arXiv ID: 2411.00348v2

Added to Library: 2025-11-11 14:06 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.

🔍 Key Points

  • Introduction of the 'distraction effect' concept, revealing how specific attention heads in LLMs are manipulated during prompt injection attacks.
  • Development of Attention Tracker ( extbackslash attn), a training-free method for detecting prompt injection attacks by analyzing attention patterns, achieving significant AUROC improvements compared to existing methods.
  • Demonstration of the robustness and generalizability of the distraction effect across different LLMs, attack types, and datasets, making the detection method applicable in diverse contexts.
  • Validation of the effectiveness of extbackslash attn on both large and small LMs, addressing limitations of previous detection methods that required larger models for accuracy.
  • Extensive experimental evaluations confirming the higher detection accuracy, with improvements ranging from 3.1% to 10.0% in AUROC scores over current baselines.

💡 Why This Paper Matters

This paper is relevant due to its novel approach to detecting prompt injection attacks in LLMs, a critical vulnerability in AI systems. By uncovering the distraction effect and utilizing it for real-time attack detection without needing additional training or costly inference, it presents a significant advancement in securing AI applications. The findings contribute to the ongoing conversation on AI safety and robustness, especially as LLMs become more integrated into sensitive and critical applications.

🎯 Why It's Interesting for AI Security Researchers

The paper is of great interest to AI security researchers as it tackles a pressing and emerging threat within the realm of large language models. Researchers can benefit from understanding the dynamics of attention mechanisms under attack conditions, which is essential for developing more resilient AI systems. Furthermore, the practical implications of the Attention Tracker in real-world applications enhance its appeal, providing a framework that can be adapted and scaled to safeguard various LLM-integrated systems.

📚 Read the Full Paper