← Back to Library

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Authors: Zhiheng Li, Zongyang Ma, Yuntong Pan, Ziqi Zhang, Xiaolei Lv, Bo Li, Jun Gao, Jianing Zhang, Chunfeng Yuan, Bing Li, Weiming Hu

Published: 2026-04-08

arXiv ID: 2604.06950v2

Added to Library: 2026-04-10 02:03 UTC

Red Teaming

📄 Abstract

Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.

🔍 Key Points

  • Introduction of Adversarial Smuggling Attacks (ASA) as a new class of threat in MLLM moderation, exploiting the gap between human and AI perception.
  • Development of SmuggleBench, a comprehensive benchmark comprising 1,700 adversarial instances to evaluate the vulnerability of MLLMs to ASA.
  • Characterization of two attack pathways for ASA: 'Perceptual Blindness' and 'Reasoning Blockade', illustrating distinct failures in text recognition and semantic understanding.
  • Empirical evidence showing that both proprietary and open-source state-of-the-art models exhibit high Attack Success Rates (ASR) above 90%.
  • Preliminary exploration of mitigation strategies such as Chain-of-Thought prompting (CoT) and Supervised Fine-Tuning (SFT), highlighting the challenges in achieving robust defenses.

💡 Why This Paper Matters

This paper is significant as it highlights a critical vulnerability in the rapidly evolving field of automated content moderation using MLLMs. The introduction of Adversarial Smuggling Attacks poses real-world risks, enabling the circulation of harmful content while evading detection. By systematically evaluating the threats via SmuggleBench, it underscores the urgent need for improved model robustness and suggestions for future defenses.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of particular interest as it identifies specific limitations in current MLLM architectures, elucidates new attack methodologies, and assesses their effectiveness. The research raises crucial questions about the adequacy of existing content moderation strategies and encourages further investigation into adaptive defenses against emerging adversarial tactics.

📚 Read the Full Paper