When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

📄 Abstract

Detecting prompt injection and jailbreak attacks is critical for deploying LLM-based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack detection becomes essential. Yet current evaluation practices and production systems have fundamental limitations. We present a comprehensive analysis using a diverse benchmark of 18 datasets spanning harmful requests, jailbreaks, indirect prompt injections, and extraction attacks. We propose Leave-One-Dataset-Out (LODO) evaluation to measure true out-of-distribution generalization, revealing that the standard practice of train-test splits from the same dataset sources severely overestimates performance: aggregate metrics show an 8.4 percentage point AUC inflation, but per-dataset gaps range from 1% to 25% accuracy-exposing heterogeneous failure modes. To understand why classifiers fail to generalize, we analyze Sparse Auto-Encoder (SAE) feature coefficients across LODO folds, finding that 28% of top features are dataset-dependent shortcuts whose class signal depends on specific dataset compositions rather than semantic content. We systematically compare production guardrails (PromptGuard 2, LlamaGuard) and LLM-as-judge approaches on our benchmark, finding all three fail on indirect attacks targeting agents (7-37% detection) and that PromptGuard 2 and LlamaGuard cannot evaluate agentic tool injection due to architectural limitations. Finally, we show that LODO-stable SAE features provide more reliable explanations for classifier decisions by filtering dataset artifacts. We release our evaluation framework at https://github.com/maxf-zn/prompt-mining to establish LODO as the appropriate protocol for prompt attack detection research.

🔍 Key Points

Introduction of Leave-One-Dataset-Out (LODO) evaluation methodology to assess out-of-distribution generalization in prompt attack classifiers, revealing that standard evaluation methods overestimate performance by over 8 percentage points.
Discovery that 28% of the top features in Sparse Auto-Encoder (SAE) models represent dataset-dependent shortcuts rather than generalizable attack patterns, highlighting issues with current benchmarking practices.
Comprehensive comparison of various production guardrails (PromptGuard 2, LlamaGuard, LLM-as-judge) against the proposed activation-based classifier, with results showing all existing methods struggle significantly with indirect prompt injection attacks (detection rates as low as 28%).
Development of a method for LODO-weighted explanations, providing more interpretable results by filtering out dataset-dependent artifacts and improving trust in classification decisions by exposing genuinely predictive features.
Extension of findings to emphasize the need for robust evaluation protocols in the deployment of LLM-based agents in real-world applications, considering the critical impacts of these security issues.

💡 Why This Paper Matters

This paper is a critical contribution to the field of AI security, specifically within the context of large language models (LLMs) and their vulnerabilities to malicious prompt injections. By highlighting the inadequacies of existing evaluation frameworks and proposing a novel methodology in LODO, it sets a new standard for assessing the security of AI systems against prompt injection attacks and emphasizes the importance of generalizable detection methods that can perform reliably across diverse datasets.

🎯 Why It's Interesting for AI Security Researchers

The paper is particularly relevant to AI security researchers as it addresses growing concerns regarding the robustness of AI systems against adversarial attacks, specifically those exploiting prompt injection techniques. With LLMs increasingly integrated into various applications, this research provides critical insights into their security implications and offers methodologies that can inform the development of more secure AI systems. The findings challenge established evaluation practices and propose substantial improvements, making it fundamental reading for anyone involved in AI safety, security, and rigorous model assessment.

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper