← Back to Library

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Authors: Zhiheng Li, Zongyang Ma, Yuntong Pan, Ziqi Zhang, Xiaolei Lv, Bo Li, Jun Gao, Jianing Zhang, Chunfeng Yuan, Bing Li, Weiming Hu

Published: 2026-04-08

arXiv ID: 2604.06950v1

Added to Library: 2026-04-09 02:01 UTC

Red Teaming

📄 Abstract

Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.

🔍 Key Points

  • Identification of Adversarial Smuggling Attacks (ASA) as a novel threat to Multimodal Large Language Models (MLLMs), classifying them into two main pathways: Perceptual Blindness and Reasoning Blockade.
  • Construction of SmuggleBench, a comprehensive benchmark comprising 1,700 adversarial attack instances specifically designed to evaluate MLLM vulnerabilities against ASA.
  • Extensive evaluation of state-of-the-art MLLMs (e.g., GPT-5, Qwen3-VL) revealing alarmingly high Attack Success Rates (ASR) exceeding 90%, indicating systemic vulnerabilities in visual perception and reasoning capabilities.
  • Analysis of the root causes for these vulnerabilities: inadequacies in vision encoders, robustness gaps in Optical Character Recognition (OCR), and lack of domain-specific adversarial examples in training datasets.
  • Initial exploration of mitigation strategies, including Chain-of-Thought prompting and supervised fine-tuning, which proved partially effective but highlighted the need for more robust long-term solutions.

💡 Why This Paper Matters

This paper highlights a critical and under-explored vulnerability in the rapidly evolving field of Multimodal Large Language Models (MLLMs), focusing on the dual threat posed by adversarial smuggling. Its findings urge immediate reconsideration of the current moderation approaches and emphasize the importance of developing more resilient systems against sophisticated attacks that can exploit human-AI capability gaps. This research lays groundwork for future investigations into more effective content moderation techniques that can safeguard against such adversarial tactics.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers due to its in-depth examination of novel adversarial attack vectors targeting MLLMs, a growing area within AI. It provides critical insights into how current AI systems can be circumvented by malicious actors, which is fundamental for the development of effective defensive strategies. Furthermore, the establishment of SmuggleBench as a testing ground for evaluating the robustness of MLLMs against such attacks offers a practical tool for research in adversarial robustness and the exploration of improved moderation techniques.

📚 Read the Full Paper