← Back to Library

SafeClaw-R: Towards Safe and Secure Multi-Agent Personal Assistants

Authors: Haoyu Wang, Zibo Xiao, Yedi Zhang, Christopher M. Poskitt, Jun Sun

Published: 2026-03-28

arXiv ID: 2603.28807v1

Added to Library: 2026-04-01 02:03 UTC

📄 Abstract

LLM-based multi-agent systems (MASs) are transforming personal productivity by autonomously executing complex, cross-platform tasks. Frameworks such as OpenClaw demonstrate the potential of locally deployed agents integrated with personal data and services, but this autonomy introduces significant safety and security risks. Unintended actions from LLM reasoning failures can cause irreversible harm, while prompt injection attacks may exfiltrate credentials or compromise the system. Our analysis shows that 36.4% of OpenClaw's built-in skills pose high or critical risks. Existing approaches, including static guardrails and LLM-as-a-Judge, lack reliable real-time enforcement and consistent authority in MAS settings. To address this, we propose SafeClaw-R, a framework that enforces safety as a system-level invariant over the execution graph by ensuring that actions are mediated prior to execution, and systematically augments skills with safe counterparts. We evaluate SafeClaw-R across three representative domains: productivity platforms, third-party skill ecosystems, and code execution environments. SafeClaw-R achieves 95.2% accuracy in Google Workspace scenarios, significantly outperforming regex baselines (61.6%), detects 97.8% of malicious third-party skill patterns, and achieves 100% detection accuracy in our adversarial code execution benchmark. These results demonstrate that SafeClaw-R enables practical runtime enforcement for autonomous MASs.

🔍 Key Points

  • SLMs are found to be significantly more vulnerable to jailbreak attacks than LLMs, with the proposed empirical evaluation demonstrating this vulnerability using nine jailbreak attack strategies across various models.
  • The introduction of GUARD-SLM represents a novel approach to defense that relies on lightweight token activation analysis in the representation space of SLMs, detecting malicious prompts with high accuracy and low computational overhead.
  • The study showcases that hidden representations at multiple layers contain discernible patterns for different input types, providing a foundation for effective activation-space adversarial prompt filtering.
  • Extensive experimental validation on multiple SLMs and LLMs highlights the robustness of GUARD-SLM, achieving near-zero jailbreak success rates for various attack categories while ensuring real-time performance.
  • The findings reveal that layered analysis of internal representations can lead to better understanding and defenses against jailbreaking, emphasizing the need for layered-based intrusion detection in language models.

💡 Why This Paper Matters

This paper is crucial as it addresses the growing concern of jailbreak attacks on small language models, a domain often overshadowed by larger models. By presenting a novel defense mechanism in GUARD-SLM that leverages token activation analysis, the research provides a practical and scalable solution to enhance the security of language models, particularly in resource-constrained environments. The insights gained through layer-wise sensitivity analysis also contribute significantly to the overall understanding of model vulnerabilities, making this work relevant not only for immediate defensive strategies but also for future research in model safety.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly interesting as it sheds light on vulnerabilities specific to small language models, which are increasingly deployed in various applications. The innovative method of using internal layer activations for prompt filtering represents a significant shift towards proactive security measures in AI systems. Furthermore, as adversarial techniques like jailbreak attacks evolve, understanding and improving defenses becomes imperative, positioning this research as a valuable contribution to the field of AI safety and security.

📚 Read the Full Paper