Soft Instruction De-escalation Defense

Authors: Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes, David Stutz, Ilia Shumailov

Published: 2025-10-24

arXiv ID: 2510.21057v2

Added to Library: 2026-01-21 03:01 UTC

📄 Abstract

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re-evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction-like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed injections in later steps. Although immediately useful, worst-case analysis shows that SIC is not infallible; strong adversary can still get a 15% ASR by embedding non-imperative workflows. This nonetheless raises the bar.

🔍 Key Points

Introduction of Zero-Shot Embedding Drift Detection (ZEDD) as a lightweight and efficient mechanism for detecting prompt injection attacks in Large Language Models (LLMs).
ZEDD measures semantic drift in embedding space without requiring model retraining, internal access, or prior knowledge of attack patterns, making it highly adaptable across different LLM architectures.
Extensive evaluation showed ZEDD achieves over 93% accuracy with a false positive rate under 3% for detecting various types of prompt injections, outperforming existing methods.
Development of a comprehensive LLMail-Inject dataset to benchmark prompt injection detection across five categories, demonstrating ZEDD's effectiveness against real-world attack strategies.
ZEDD combines Gaussian Mixture Models (GMM) and Kernel Density Estimation (KDE) for effective drift detection, enhancing both precision and recall in a low-latency operational framework.

💡 Why This Paper Matters

This paper presents a significant advancement in securing LLMs against growing vulnerabilities associated with prompt injection attacks. By providing a robust, efficient, and practical detection framework, ZEDD enhances the resilience of LLM applications, making them safer for deployment in sensitive contexts. Its high accuracy and ease of integration position it as a pivotal tool in the evolving landscape of AI security.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it tackles a pressing threat—prompt injection attacks on LLMs, which are increasingly exploited in real-world scenarios. ZEDD offers a novel detection approach that is model-agnostic, scalable, and efficient, addressing a significant gap in existing defenses. Researchers focusing on the security of AI systems will find the findings of this paper particularly insightful, as they highlight both the vulnerabilities of LLMs and potential solutions for enhancing their robustness against adversarial manipulation.

Soft Instruction De-escalation Defense

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper