Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

Authors: Jakub Hoscilowicz, Artur Janicki

Published: 2025-11-25

arXiv ID: 2511.20494v3

Added to Library: 2025-12-02 04:01 UTC

Red Teaming

📄 Abstract

We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Practical applications include embedding such adversarial images into websites to prevent MLLM-powered AI Agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and Adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.

🔍 Key Points

Introduction of the Adversarial Confusion Attack, a new method targeting the instability of multimodal large language models (MLLMs) by maximizing output entropy.
Demonstration that a single adversarial image can disrupt an ensemble of models, both in white-box and black-box settings, effectively increasing confusion and hallucinations in model outputs.
Characterization of five distinct modes of confusion experienced by models under attack, illuminating the potential risks of confusion attacks in practical applications.
Effectiveness of the attack is shown to transfer to proprietary models like GPT-5.1 and other open-source models, revealing vulnerabilities that could have broader implications for AI security.
Potential practical applications in designing web interfaces or systems that can act as barriers against MLLM-powered AI agents, enhancing cybersecurity measures.

💡 Why This Paper Matters

The paper presents a groundbreaking approach to adversarial attacks on MLLMs, showcasing the potential for inducing confusion and disrupting AI behavior through strategic manipulations of input data. This relevance not only lies in expanding the adversarial attack landscape but also in fostering the development of protective measures against such vulnerabilities in AI models.

🎯 Why It's Interesting for AI Security Researchers

This research is crucial for AI security researchers as it highlights a previously underexplored method of attack—confusion attacks—that can destabilize MLLMs widely used in various applications. Understanding these attacks allows researchers to better prepare defenses and improve the reliability of AI systems, thus addressing significant security and ethical concerns in AI deployment.

Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper