← Back to Library

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Authors: Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee

Published: 2026-04-06

arXiv ID: 2604.05172v2

Added to Library: 2026-04-09 02:01 UTC

Safety

📄 Abstract

Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification. We release the trajectories and future dataset at https://clawsbench.com.

🔍 Key Points

  • Introduction of ClawsBench, a benchmark for evaluating LLM agents in realistic productivity settings with five high-fidelity mock services and state management capabilities.
  • Establishment of structured scoring mechanisms for performance (Task Success Rate) and safety (Unsafe Action Rate), emphasizing the dual focus on capability and safety.
  • Identification of eight recurring patterns of unsafe agent behavior, facilitating better understanding and prevention in LLM deployments.
  • Demonstration that agent scaffolding (skills with progressive disclosure and a meta prompt) significantly improves task success while highlighting the trade-off with safety.
  • Experimental results reveal that current LLM capabilities do not directly correlate to safety, with some models showing high performance but also increased unsafe actions.

💡 Why This Paper Matters

The findings from the ClawsBench study are highly relevant as they address the critical need for safe and effective deployment of LLM agents in productivity workflows. By providing a rigorous evaluation framework, the paper contributes to a growing understanding of LLM behavior in complex environments, especially concerning safety-critical applications. This benchmark sets a foundation for ongoing research and development in LLM safety, which is imperative as reliance on these technologies continues to rise.

🎯 Why It's Interesting for AI Security Researchers

This paper is of interest to AI security researchers because it directly tackles the intersection of capability and safety in large language models, particularly in environments where safety risks are significant. The identification of unsafe behavior patterns and the systematic evaluation framework establish a benchmark for assessing LLM safety, which can aid researchers in developing more resilient AI systems. The insights gained from ClawsBench can inform best practices in AI deployment, regulation, and the development of safety protocols essential in high-stakes applications.

📚 Read the Full Paper