← Back to Library

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

Authors: Yinghan Hou, Zongyou Yang

Published: 2026-04-08

arXiv ID: 2604.06550v1

Added to Library: 2026-04-09 02:01 UTC

📄 Abstract

OpenClaw's ClawHub marketplace hosts over 13,000 community-contributed agent skills, and between 13% and 26% of them contain security vulnerabilities according to recent audits. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural language instructions in SKILL.md files where prompt injection and social engineering attacks hide. Neither approach handles both modalities. SkillSieve is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer, filtering roughly 86% of benign skills in under 40ms on average at zero API cost. Layer 2 sends suspicious skills to an LLM, but instead of asking one broad question, it splits the analysis into four parallel sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency), each with its own prompt and structured output. Layer 3 puts high-risk skills before a jury of three different LLMs that vote independently and, if they disagree, debate before reaching a verdict. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the full pipeline on a 440 ARM single-board computer. On a 400-skill labeled benchmark, SkillSieve achieves 0.800 F1, outperforming ClawVet's 0.421, at an average cost of 0.006 per skill. Code, data, and benchmark are open-sourced.

🔍 Key Points

  • Identification of Adversarial Smuggling Attacks (ASA) as a novel threat to Multimodal Large Language Models (MLLMs), classifying them into two main pathways: Perceptual Blindness and Reasoning Blockade.
  • Construction of SmuggleBench, a comprehensive benchmark comprising 1,700 adversarial attack instances specifically designed to evaluate MLLM vulnerabilities against ASA.
  • Extensive evaluation of state-of-the-art MLLMs (e.g., GPT-5, Qwen3-VL) revealing alarmingly high Attack Success Rates (ASR) exceeding 90%, indicating systemic vulnerabilities in visual perception and reasoning capabilities.
  • Analysis of the root causes for these vulnerabilities: inadequacies in vision encoders, robustness gaps in Optical Character Recognition (OCR), and lack of domain-specific adversarial examples in training datasets.
  • Initial exploration of mitigation strategies, including Chain-of-Thought prompting and supervised fine-tuning, which proved partially effective but highlighted the need for more robust long-term solutions.

💡 Why This Paper Matters

This paper highlights a critical and under-explored vulnerability in the rapidly evolving field of Multimodal Large Language Models (MLLMs), focusing on the dual threat posed by adversarial smuggling. Its findings urge immediate reconsideration of the current moderation approaches and emphasize the importance of developing more resilient systems against sophisticated attacks that can exploit human-AI capability gaps. This research lays groundwork for future investigations into more effective content moderation techniques that can safeguard against such adversarial tactics.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers due to its in-depth examination of novel adversarial attack vectors targeting MLLMs, a growing area within AI. It provides critical insights into how current AI systems can be circumvented by malicious actors, which is fundamental for the development of effective defensive strategies. Furthermore, the establishment of SmuggleBench as a testing ground for evaluating the robustness of MLLMs against such attacks offers a practical tool for research in adversarial robustness and the exploration of improved moderation techniques.

📚 Read the Full Paper