← Back to Library

MCP-SandboxScan: WASM-based Secure Execution and Runtime Analysis for MCP Tools

Authors: Zhuoran Tan, Run Hao, Jeremy Singer, Yutian Tang, Christos Anagnostopoulos

Published: 2026-01-03

arXiv ID: 2601.01241v1

Added to Library: 2026-01-07 10:03 UTC

📄 Abstract

Tool-augmented LLM agents raise new security risks: tool executions can introduce runtime-only behaviors, including prompt injection and unintended exposure of external inputs (e.g., environment secrets or local files). While existing scanners often focus on static artifacts, analyzing runtime behavior is challenging because directly executing untrusted tools can itself be dangerous. We present MCP-SandboxScan, a lightweight framework motivated by the Model Context Protocol (MCP) that safely executes untrusted tools inside a WebAssembly/WASI sandbox and produces auditable reports of external-to-sink exposures. Our prototype (i) extracts LLM-relevant sinks from runtime outputs (prompt/messages and structured tool-return fields), (ii) instantiates external-input candidates from environment values, mounted file contents, and output-surfaced HTTP fetch intents, and (iii) links sources to sinks via snippet-based substring matching. Case studies on three representative tools show that MCP-SandboxScan can surface provenance evidence when external inputs appear in prompt/messages or tool-return payloads, and can expose filesystem capability violations as runtime evidence. We further compare against a lightweight static string-signature baseline and use a micro-benchmark to characterize false negatives under transformations and false positives from short-token collisions.

🔍 Key Points

  • Introduction of OpenRT, a modular and high-throughput red-teaming framework for Multimodal Large Language Models (MLLMs), addressing fragmentation and scalability issues in existing benchmarks.
  • Integration of 37 diverse attack methodologies that allow comprehensive evaluation of MLLM safety, including advanced strategies like multi-modal perturbations and multi-agent evolutionary approaches.
  • Empirical analysis reveals significant safety vulnerabilities across 20 evaluated models, with average Attack Success Rates (ASR) reaching up to 49.14%, exposing inadequacies in current safety mechanisms.
  • The framework showcases that even advanced models do not inherently possess improved robustness against adaptive, multi-modal attacks, and highlights the trend of evolving attack strategies that exploit novel model capabilities.
  • Provision of an open-source platform that encourages continuous development and community engagement towards enhancing AI safety through systematic evaluation and defense strategies.

💡 Why This Paper Matters

The OpenRT framework represents a significant advancement in the systematic evaluation of Multimodal Large Language Models' safety. Its comprehensive design and implementation of diverse attack methodologies expose critical vulnerabilities in state-of-the-art models, highlighting the urgent need for improved safety mechanisms. OpenRT not only provides a valuable resource for researchers looking to understand and mitigate risks associated with MLLMs but also sets a foundation for future work in AI safety.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it addresses current challenges in evaluating the safety of advanced AI models, particularly in the context of multimodal interactions. The introduction of an open-source framework that consolidates various attack methodologies offers researchers a powerful tool for benchmarking and developing resilient models. Moreover, findings regarding vulnerabilities in leading models underline the importance of continuous testing and evolution of AI systems in response to emerging threats.

📚 Read the Full Paper