← Back to Library

Agent Privilege Separation in OpenClaw: A Structural Defense Against Prompt Injection

Authors: Darren Cheng, Wen-Kwang Tsao

Published: 2026-03-13

arXiv ID: 2603.13424v1

Added to Library: 2026-03-17 02:01 UTC

📄 Abstract

Prompt injection remains one of the most practical attack vectors against LLM-integrated applications. We replicate the Microsoft LLMail-Inject benchmark (Greshake et al., 2024) against current generation models running inside OpenClaw, an open source multitool agent platform. Our proposed defense combines two mechanisms: agent isolation, implemented as a privilege separated two-agent pipeline with tool partitioning, and JSON formatting, which produces structured output that strips persuasive framing before the action agent processes it. We run four experiments on the same 649 attacks that succeeded against our single-agent baseline. The full pipeline achieves 0 percent attack success rate (ASR) on the evaluated benchmark. Agent isolation alone achieves 0.31 percent ASR, approximately 323 times lower than the baseline. JSON formatting alone achieves 14.18 percent ASR, about 7.1 times lower. Our ablation study confirms that agent isolation is the dominant mechanism. JSON formatting provides additional hardening but is not sufficient on its own. The defense is structural: the action agent never receives raw injection content regardless of model behavior on any individual input.

🔍 Key Points

  • Introduction of SWhisper, a framework for covert prompt-based attacks against speech-driven LLMs using commodity hardware and inaudible near-ultrasonic audio.
  • Demonstration of robust, inaudible delivery of arbitrary target audio with high fidelity even under realistic deployment conditions, highlighting its effectiveness against commercial and open-source models.
  • Development of an optimization-based jailbreaking prompt that balances intelligibility, brevity, and transferability, which is critical for practical applications.
  • Experimental validation revealing strong black-box effectiveness, achieving non-refusal and specific-convincing scores close to 0.95, and demonstrating perceptually indistinguishable injected audio.
  • Discussion of broader security implications, revealing additional attack vectors enabled by the covert acoustic channel for a range of malicious activities beyond jailbreaking.

💡 Why This Paper Matters

The paper presents significant advancements in the field of AI security by introducing SWhisper, a practical framework that exploits the vulnerabilities of speech-driven language models. By enabling covert prompt injections through inaudible audio, it highlights critical security risks that can undermine the integrity and safety of AI systems. The experimental results validate its effectiveness and invisibility, raising awareness about potential attack vectors that are not only innovative but also practically deployable.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it uncovers a novel attack vector in speech-driven systems, emphasizing the challenges of ensuring secure AI interactions in real-world environments. The sophisticated methods and comprehensive analysis of SWhisper contribute significantly to the understanding of vulnerabilities in LLMs and encourage the development of countermeasures to enhance the safety of AI applications.

📚 Read the Full Paper