← Back to Library

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Authors: Manish Bhatt, Sarthak Munshi, Vineeth Sai Narajala, Idan Habler, Ammar Al-Kahfah, Ken Huang, Blake Gatto

Published: 2026-04-07

arXiv ID: 2604.06436v1

Added to Library: 2026-04-09 02:01 UTC

📄 Abstract

We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $ε$-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.

🔍 Key Points

  • Identification of Adversarial Smuggling Attacks (ASA) as a novel threat to Multimodal Large Language Models (MLLMs), classifying them into two main pathways: Perceptual Blindness and Reasoning Blockade.
  • Construction of SmuggleBench, a comprehensive benchmark comprising 1,700 adversarial attack instances specifically designed to evaluate MLLM vulnerabilities against ASA.
  • Extensive evaluation of state-of-the-art MLLMs (e.g., GPT-5, Qwen3-VL) revealing alarmingly high Attack Success Rates (ASR) exceeding 90%, indicating systemic vulnerabilities in visual perception and reasoning capabilities.
  • Analysis of the root causes for these vulnerabilities: inadequacies in vision encoders, robustness gaps in Optical Character Recognition (OCR), and lack of domain-specific adversarial examples in training datasets.
  • Initial exploration of mitigation strategies, including Chain-of-Thought prompting and supervised fine-tuning, which proved partially effective but highlighted the need for more robust long-term solutions.

💡 Why This Paper Matters

This paper highlights a critical and under-explored vulnerability in the rapidly evolving field of Multimodal Large Language Models (MLLMs), focusing on the dual threat posed by adversarial smuggling. Its findings urge immediate reconsideration of the current moderation approaches and emphasize the importance of developing more resilient systems against sophisticated attacks that can exploit human-AI capability gaps. This research lays groundwork for future investigations into more effective content moderation techniques that can safeguard against such adversarial tactics.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers due to its in-depth examination of novel adversarial attack vectors targeting MLLMs, a growing area within AI. It provides critical insights into how current AI systems can be circumvented by malicious actors, which is fundamental for the development of effective defensive strategies. Furthermore, the establishment of SmuggleBench as a testing ground for evaluating the robustness of MLLMs against such attacks offers a practical tool for research in adversarial robustness and the exploration of improved moderation techniques.

📚 Read the Full Paper