← Back to Library

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Authors: Manish Bhatt, Sarthak Munshi, Vineeth Sai Narajala, Idan Habler, Ammar Al-Kahfah, Ken Huang, Joel Webb, Blake Gatto

Published: 2026-04-07

arXiv ID: 2604.06436v2

Added to Library: 2026-04-10 02:03 UTC

📄 Abstract

We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $ε$-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.

🔍 Key Points

  • Introduction of Adversarial Smuggling Attacks (ASA) as a new class of threat in MLLM moderation, exploiting the gap between human and AI perception.
  • Development of SmuggleBench, a comprehensive benchmark comprising 1,700 adversarial instances to evaluate the vulnerability of MLLMs to ASA.
  • Characterization of two attack pathways for ASA: 'Perceptual Blindness' and 'Reasoning Blockade', illustrating distinct failures in text recognition and semantic understanding.
  • Empirical evidence showing that both proprietary and open-source state-of-the-art models exhibit high Attack Success Rates (ASR) above 90%.
  • Preliminary exploration of mitigation strategies such as Chain-of-Thought prompting (CoT) and Supervised Fine-Tuning (SFT), highlighting the challenges in achieving robust defenses.

💡 Why This Paper Matters

This paper is significant as it highlights a critical vulnerability in the rapidly evolving field of automated content moderation using MLLMs. The introduction of Adversarial Smuggling Attacks poses real-world risks, enabling the circulation of harmful content while evading detection. By systematically evaluating the threats via SmuggleBench, it underscores the urgent need for improved model robustness and suggestions for future defenses.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of particular interest as it identifies specific limitations in current MLLM architectures, elucidates new attack methodologies, and assesses their effectiveness. The research raises crucial questions about the adequacy of existing content moderation strategies and encourages further investigation into adaptive defenses against emerging adversarial tactics.

📚 Read the Full Paper