← Back to Library

A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

Authors: Tianyu Chen, Dongrui Liu, Xia Hu, Jingyi Yu, Wenjie Wang

Published: 2026-02-16

arXiv ID: 2602.14364v1

Added to Library: 2026-02-17 04:01 UTC

📄 Abstract

Clawdbot is a self-hosted, tool-using personal AI agent with a broad action space spanning local execution and web-mediated workflows, which raises heightened safety and security concerns under ambiguity and adversarial steering. We present a trajectory-centric evaluation of Clawdbot across six risk dimensions. Our test suite samples and lightly adapts scenarios from prior agent-safety benchmarks (including ATBench and LPS-Bench) and supplements them with hand-designed cases tailored to Clawdbot's tool surface. We log complete interaction trajectories (messages, actions, tool-call arguments/outputs) and assess safety using both an automated trajectory judge (AgentDoG-Qwen3-4B) and human review. Across 34 canonical cases, we find a non-uniform safety profile: performance is generally consistent on reliability-focused tasks, while most failures arise under underspecified intent, open-ended goals, or benign-seeming jailbreak prompts, where minor misinterpretations can escalate into higher-impact tool actions. We supplemented the overall results with representative case studies and summarized the commonalities of these cases, analyzing the security vulnerabilities and typical failure modes that Clawdbot is prone to trigger in practice.

🔍 Key Points

  • Introduction of MAPA (Multi-Turn Adaptive Prompting Attack), a novel method for conducting multi-turn jailbreak attacks on large vision-language models (LVLMs) by alternating attack actions and adjusting attack trajectories across turns.
  • Demonstration of how traditional single-turn and naive multi-turn attacks are insufficient in breaking through safety mechanisms in LVLMs, highlighting the necessity for a more sophisticated approach like MAPA.
  • Significant improvements in attack success rates (11-35%) over state-of-the-art methods, showcasing MAPA's effectiveness across multiple LVLMs including LLaVA-V1.6-Mistral-7B, GPT-4o-mini, and others.
  • Thorough empirical validation of MAPA through extensive experiments on benchmark datasets (HarmBench, JailbreakBench) which demonstrate consistent performance superiority compared to existing approaches.
  • Incorporation of adaptive reflection mechanisms in the attack approach to learn from past failures, enhancing the robustness and effectiveness of red-teaming attempts against LVLMs.

💡 Why This Paper Matters

This paper introduces a critical advancement in the field of AI security by presenting MAPA, a sophisticated approach to jailbreak attacks on large vision-language models. Understanding how to successfully manipulate these models has significant implications for improving their safety and resistance to adversarial prompts. As the field evolves, developing methods like MAPA will be pivotal in testing the robustness of AI systems, ensuring they can be relied upon in sensitive applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant for AI security researchers as it addresses the vulnerabilities of advanced vision-language models, which are becoming more common in various applications. By understanding how adversarial attacks can successfully manipulate these models, researchers can explore defenses and safety measures that can mitigate such risks, thereby contributing to the overall security of AI systems.

📚 Read the Full Paper