AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

Authors: Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, Neil Gong

Published: 2026-02-14

arXiv ID: 2602.13597v2

Added to Library: 2026-02-24 03:00 UTC

Red Teaming

📄 Abstract

Prompt injection attacks insert malicious instructions into an LLM's input to steer it toward an attacker-chosen task instead of the intended one. Existing detection defenses typically classify any input with instruction as malicious, leading to misclassification of benign inputs containing instructions that align with the intended task. In this work, we account for the instruction hierarchy and distinguish among three categories: inputs with misaligned instructions, inputs with aligned instructions, and non-instruction inputs. We introduce AlignSentinel, a three-class classifier that leverages features derived from LLM's attention maps to categorize inputs accordingly. To support evaluation, we construct the first systematic benchmark containing inputs from all three categories. Experiments on both our benchmark and existing ones--where inputs with aligned instructions are largely absent--show that AlignSentinel accurately detects inputs with misaligned instructions and substantially outperforms baselines.

🔍 Key Points

Introduction of AlignSentinel, a novel three-class classifier to effectively detect misaligned, aligned, and non-instruction inputs in prompt injection scenarios.
Utilization of attention maps from Large Language Models (LLMs) as features for classification, distinguishing various instruction hierarchies.
Development of a comprehensive benchmark that includes all three input categories (misaligned, aligned, and non-instruction) to facilitate better evaluation of detection methods.
Demonstration through experiments that AlignSentinel significantly outperforms existing baselines in identifying misaligned instructions, achieving low false positive and negative rates across various domains.
Evaluation of the model's generalizability across different LLMs and benchmarks, ensuring robust performance in diverse contexts.

💡 Why This Paper Matters

This paper provides a crucial advancement in the detection of prompt injection attacks by addressing the nuanced nature of instructions within inputs. The introduction of the AlignSentinel framework not only enhances detection accuracy but also reduces false positives, making it a significant step forward in securing Large Language Models against these attacks. The deployment of attention-based features assists in this nuanced detection, ultimately contributing to the reliability of LLM applications.

🎯 Why It's Interesting for AI Security Researchers

As AI security researchers focus on safeguarding LLMs from various manipulation techniques, this paper is particularly relevant. It introduces a sophisticated detection methodology that tackles the increasingly sophisticated nature of prompt injection attacks. Understanding and mitigating these vulnerabilities is essential for ensuring the safe deployment of AI technologies in real-world applications. Moreover, the proposed benchmark enhances the research landscape by providing necessary tools for evaluating detection methods, promoting advancements in AI security protocols.

AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper