AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

Authors: Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, Neil Gong

Published: 2026-02-14

arXiv ID: 2602.13597v1

Added to Library: 2026-02-17 03:01 UTC

Red Teaming

📄 Abstract

% Prompt injection attacks insert malicious instructions into an LLM's input to steer it toward an attacker-chosen task instead of the intended one. Existing detection defenses typically classify any input with instruction as malicious, leading to misclassification of benign inputs containing instructions that align with the intended task. In this work, we account for the instruction hierarchy and distinguish among three categories: inputs with misaligned instructions, inputs with aligned instructions, and non-instruction inputs. We introduce AlignSentinel, a three-class classifier that leverages features derived from LLM's attention maps to categorize inputs accordingly. To support evaluation, we construct the first systematic benchmark containing inputs from all three categories. Experiments on both our benchmark and existing ones--where inputs with aligned instructions are largely absent--show that AlignSentinel accurately detects inputs with misaligned instructions and substantially outperforms baselines.

🔍 Key Points

Proposes AlignSentinel, a three-class classifier for detecting prompt injection attacks in LLMs, distinguishing between misaligned, aligned, and non-instruction inputs.
Utilizes features from LLM attention maps to enhance the detection process, significantly outperforming traditional binary classification approaches.
Constructs a novel benchmark for prompt injection detection that incorporates the instruction hierarchy, allowing for systemic evaluation of detection methods.
Demonstrates superior generalizability across various LLMs and application domains in both direct and indirect prompt injection scenarios.
Conducts extensive experiments showing reduced false positive and false negative rates compared to existing methods.

💡 Why This Paper Matters

This paper addresses a critical vulnerability in large language models regarding prompt injection attacks. The introduction of an alignment-aware detection system marks a significant step towards enhancing the security of these models, which are increasingly utilized in sensitive applications. By offering a systematic benchmark for evaluation and demonstrating the effectiveness of the proposed method, the work contributes valuable insights to the field.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper relevant as it tackles a unique and evolving threat—prompt injection attacks—in LLMs, a cornerstone of AI technology. The development of a robust detection framework and benchmarking standards provides a foundation for future research in securing AI applications against such vulnerabilities. Moreover, understanding and mitigating these attack vectors is crucial for building trust in AI systems used in critical sectors like healthcare, finance, and governance.

AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper