← Back to Library

ClawSafety: "Safe" LLMs, Unsafe Agents

Authors: Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, Yingqiang Ge

Published: 2026-04-01

arXiv ID: 2604.01438v1

Added to Library: 2026-04-03 02:03 UTC

Red Teaming

📄 Abstract

Personal AI agents like OpenClaw run with elevated privileges on users' local machines, where a single successful prompt injection can leak credentials, redirect financial transactions, or destroy files. This threat goes well beyond conventional text-level jailbreaks, yet existing safety evaluations fall short: most test models in isolated chat settings, rely on synthetic environments, and do not account for how the agent framework itself shapes safety outcomes. We introduce CLAWSAFETY, a benchmark of 120 adversarial test scenarios organized along three dimensions (harm domain, attack vector, and harmful action type) and grounded in realistic, high-privilege professional workspaces spanning software engineering, finance, healthcare, law, and DevOps. Each test case embeds adversarial content in one of three channels the agent encounters during normal work: workspace skill files, emails from trusted senders, and web pages. We evaluate five frontier LLMs as agent backbones, running 2,520 sandboxed trials across all configurations. Attack success rates (ASR) range from 40\% to 75\% across models and vary sharply by injection vector, with skill instructions (highest trust) consistently more dangerous than email or web content. Action-trace analysis reveals that the strongest model maintains hard boundaries against credential forwarding and destructive actions, while weaker models permit both. Cross-scaffold experiments on three agent frameworks further demonstrate that safety is not determined by the backbone model alone but depends on the full deployment stack, calling for safety evaluation that treats model and framework as joint variables.

🔍 Key Points

  • Introduction of ClawSafety, a comprehensive benchmark with 120 adversarial test scenarios focusing on high-privilege professional workspaces, covering diverse domains such as software engineering, finance, healthcare, law, and DevOps.
  • Demonstration that existing safety evaluations are insufficient, as they often overlook the interplay between LLMs and agent frameworks. The benchmark treats both components as joint variables in safety evaluations, prompting new methodologies for assessing agent safety.
  • Evaluation of five frontline LLMs through 2,520 sandboxed trials showed attack success rates varying significantly (40% to 75%), depending on the model and attack vector, indicating different safety profiles and trust-level gradients.
  • Discovery of critical defense boundaries within LLMs: the strongest model maintained a hard boundary against credential forwarding and destructive actions, while weaker models did not, implying significant safety variations among models.
  • Findings concluded that safety is not solely the domain of the LLM but requires a comprehensive evaluation of the entire deployment stack, calling for new safety evaluation frameworks.

💡 Why This Paper Matters

The paper is pivotal in addressing the emerging threats posed by personal AI agents that operate with elevated privileges. By establishing a detailed benchmark like ClawSafety, it not only highlights the inadequacies in current safety evaluations but also provides a structured methodology that could enhance our understanding of vulnerabilities and guide development of safer AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly significant for AI security researchers because it delves into the nuances of safety evaluation in LLM-driven agents, where traditional methodologies fall short. The insights gained from ClawSafety's comprehensive framework offer valuable contributions to understanding attack vectors, safety boundaries, and the interaction between different components of AI systems, ultimately driving advancements in the development of secure, reliable personal AI agents.

📚 Read the Full Paper