ClawSafety: "Safe" LLMs, Unsafe Agents

Authors: Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, Yingqiang Ge

Published: 2026-04-01

arXiv ID: 2604.01438v2

Added to Library: 2026-04-07 02:02 UTC

Red Teaming

📄 Abstract

Personal AI agents like OpenClaw run with elevated privileges on users' local machines, where a single successful prompt injection can leak credentials, redirect financial transactions, or destroy files. This threat goes well beyond conventional text-level jailbreaks, yet existing safety evaluations fall short: most test models in isolated chat settings, rely on synthetic environments, and do not account for how the agent framework itself shapes safety outcomes. We introduce CLAWSAFETY, a benchmark of 120 adversarial test scenarios organized along three dimensions (harm domain, attack vector, and harmful action type) and grounded in realistic, high-privilege professional workspaces spanning software engineering, finance, healthcare, law, and DevOps. Each test case embeds adversarial content in one of three channels the agent encounters during normal work: workspace skill files, emails from trusted senders, and web pages. We evaluate five frontier LLMs as agent backbones, running 2,520 sandboxed trials across all configurations. Attack success rates (ASR) range from 40\% to 75\% across models and vary sharply by injection vector, with skill instructions (highest trust) consistently more dangerous than email or web content. Action-trace analysis reveals that the strongest model maintains hard boundaries against credential forwarding and destructive actions, while weaker models permit both. Cross-scaffold experiments on three agent frameworks further demonstrate that safety is not determined by the backbone model alone but depends on the full deployment stack, calling for safety evaluation that treats model and framework as joint variables. Code and data will be available at: https://weibowen555.github.io/ClawSafety/.

🔍 Key Points

Introduction of ClawSafety, a benchmark with 120 adversarial test scenarios for evaluating AI agent safety in real-world professional contexts.
Analysis of attack success rates (ASR) among five frontier LLMs, revealing a stark variance in safety depending on the model and scaffolding framework used.
Identification of a trust-level gradient across different attack vectors (skill file injection, email injection, web content), emphasizing the efficacy of operational specificity over authority in malicious instructions.
Cross-scaffold analysis indicating that safety is a property of the complete deployment stack (model and framework), rather than the model alone, promoting comprehensive safety assessments.
Exploration of qualitative case studies that reveal specific mechanisms and vulnerabilities not captured by aggregate metrics.

💡 Why This Paper Matters

The paper introduces ClawSafety, a crucial advancement in benchmarking the safety of personal AI agents in high-stakes environments, addressing the urgent need for rigorous evaluation frameworks that account for both the models and their deployment contexts. By illustrating how adversarial threats manifest in realistic scenarios, this work lays the groundwork for safer AI applications in sensitive domains like finance and healthcare, ultimately contributing to more robust AI safety standards.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it provides a novel framework for evaluating the safety of AI systems in realistic scenarios, identifying vulnerabilities specific to personal AI agents. Its empirical findings on attack vectors and model vulnerabilities not only deepen the understanding of LLM safety but also inform the development of more resilient AI systems by outlining key mechanisms that can be exploited. Furthermore, the introduction of a detailed threat taxonomy and comprehensive evaluation methodology presents a valuable tool for ongoing safety research and development.

ClawSafety: "Safe" LLMs, Unsafe Agents

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper