CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

📄 Abstract

The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.

🔍 Key Points

Systematic evaluation of indirect prompt injection attacks on large language models (LLMs), highlighting vulnerabilities across various models and implementations.
Identification of key factors influencing model susceptibility including size and architecture, revealing persistent weaknesses even in advanced models.
Development of novel obfuscation techniques utilized in the attack scenarios, allowing adversaries to exploit models through hidden instructions embedded in seemingly benign inputs.
Empirical evidence showcasing varying degrees of resilience among different LLMs, with some models exhibiting alarming rates of successful attacks despite advanced security mechanisms.
Recommendations for improving LLM defenses, including a centralized database of attack vectors and the integration of security into model training processes.

💡 Why This Paper Matters

This paper is critically relevant in addressing the emerging threats posed by indirect prompt injection attacks on LLMs, underscoring the necessity for enhanced security frameworks in AI applications. The findings not only highlight significant vulnerabilities in existing models but also provide a structured approach for future developments in AI security protocols, making it a pivotal resource for safeguarding corporate data against unprecedented threats.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper offers crucial insights into the evolving landscape of vulnerabilities associated with LLMs, particularly in the context of their integration with external data sources. The empirical data on attack success rates, coupled with a comprehensive analysis of obfuscation techniques, serves as a foundational study for understanding and mitigating security threats in generative AI systems. Furthermore, the proposed frameworks and defensive strategies can guide researchers in developing robust countermeasures against increasingly sophisticated adversarial tactics.

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper