← Back to Library

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

Authors: Anirudh Sekar, Mrinal Agarwal, Rachel Sharma, Akitsugu Tanaka, Jasmine Zhang, Arjun Damerla, Kevin Zhu

Published: 2026-01-18

arXiv ID: 2601.12359v1

Added to Library: 2026-01-21 03:01 UTC

Red Teaming Safety

πŸ“„ Abstract

Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful or unintended outputs. Despite advances in alignment, even state-of-the-art LLMs remain broadly vulnerable to adversarial prompts, underscoring the urgent need for robust, productive, and generalizable detection mechanisms beyond inefficient, model-specific patches. In this work, we propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts by quantifying semantic shifts in embedding space between benign and suspect inputs. ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, enabling efficient zero-shot deployment across diverse LLM architectures. Our method uses adversarial-clean prompt pairs and measures embedding drift via cosine similarity to capture subtle adversarial manipulations inherent to real-world injection attacks. To ensure robust evaluation, we assemble and re-annotate the comprehensive LLMail-Inject dataset spanning five injection categories derived from publicly available sources. Extensive experiments demonstrate that embedding drift is a robust and transferable signal, outperforming traditional methods in detection accuracy and operational efficiency. With greater than 93% accuracy in classifying prompt injections across model architectures like Llama 3, Qwen 2, and Mistral and a false positive rate of <3%, our approach offers a lightweight, scalable defense layer that integrates into existing LLM pipelines, addressing a critical gap in securing LLM-powered systems to withstand adaptive adversarial threats.

πŸ” Key Points

  • Introduction of Zero-Shot Embedding Drift Detection (ZEDD) as a lightweight and efficient mechanism for detecting prompt injection attacks in Large Language Models (LLMs).
  • ZEDD measures semantic drift in embedding space without requiring model retraining, internal access, or prior knowledge of attack patterns, making it highly adaptable across different LLM architectures.
  • Extensive evaluation showed ZEDD achieves over 93% accuracy with a false positive rate under 3% for detecting various types of prompt injections, outperforming existing methods.
  • Development of a comprehensive LLMail-Inject dataset to benchmark prompt injection detection across five categories, demonstrating ZEDD's effectiveness against real-world attack strategies.
  • ZEDD combines Gaussian Mixture Models (GMM) and Kernel Density Estimation (KDE) for effective drift detection, enhancing both precision and recall in a low-latency operational framework.

πŸ’‘ Why This Paper Matters

This paper presents a significant advancement in securing LLMs against growing vulnerabilities associated with prompt injection attacks. By providing a robust, efficient, and practical detection framework, ZEDD enhances the resilience of LLM applications, making them safer for deployment in sensitive contexts. Its high accuracy and ease of integration position it as a pivotal tool in the evolving landscape of AI security.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it tackles a pressing threatβ€”prompt injection attacks on LLMs, which are increasingly exploited in real-world scenarios. ZEDD offers a novel detection approach that is model-agnostic, scalable, and efficient, addressing a significant gap in existing defenses. Researchers focusing on the security of AI systems will find the findings of this paper particularly insightful, as they highlight both the vulnerabilities of LLMs and potential solutions for enhancing their robustness against adversarial manipulation.

πŸ“š Read the Full Paper