← Back to Library

Sentinel: SOTA model to protect against prompt injections

Authors: Dror Ivry, Oran Nahum

Published: 2025-06-05

arXiv ID: 2506.05446v1

Added to Library: 2025-11-11 14:25 UTC

📄 Abstract

Large Language Models (LLMs) are increasingly powerful but remain vulnerable to prompt injection attacks, where malicious inputs cause the model to deviate from its intended instructions. This paper introduces Sentinel, a novel detection model, qualifire/prompt-injection-sentinel, based on the \answerdotai/ModernBERT-large architecture. By leveraging ModernBERT's advanced features and fine-tuning on an extensive and diverse dataset comprising a few open-source and private collections, Sentinel achieves state-of-the-art performance. This dataset amalgamates varied attack types, from role-playing and instruction hijacking to attempts to generate biased content, alongside a broad spectrum of benign instructions, with private datasets specifically targeting nuanced error correction and real-world misclassifications. On a comprehensive, unseen internal test set, Sentinel demonstrates an average accuracy of 0.987 and an F1-score of 0.980. Furthermore, when evaluated on public benchmarks, it consistently outperforms strong baselines like protectai/deberta-v3-base-prompt-injection-v2. This work details Sentinel's architecture, its meticulous dataset curation, its training methodology, and a thorough evaluation, highlighting its superior detection capabilities.

🔍 Key Points

  • Introduction of the Optimization-based Evaluation Toolkit (OET) for benchmarking prompt injection attacks and defenses against large language models (LLMs).
  • OET provides a modular framework that supports adaptive adversarial testing using both white-box and black-box optimization methods, allowing researchers to systematically evaluate and improve model robustness.
  • Extensive evaluations reveal a significant susceptibility of open-source LLMs to adversarial attacks, highlighting the limitations of existing defense mechanisms across various datasets and domains.
  • The toolkit enables customized implementation of new attack strategies, fostering exploration of diverse defense methods and their real-world applications.

💡 Why This Paper Matters

This paper presents a critical advancement in the evaluation of adversarial robustness in LLMs through the development of OET. By addressing the gaps in current evaluation frameworks, OET enables researchers to rigorously assess defenses against prompt injection attacks, ultimately contributing to a more secure application of LLMs in various sectors.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant for AI security researchers as it provides a comprehensive evaluation framework that not only benchmarks existing defenses but also encourages the development of new ones. The insights gained from the OET toolkit can significantly impact the understanding of adversarial vulnerabilities in LLMs, making it a valuable resource for enhancing the security of AI systems.

📚 Read the Full Paper