← Back to Library

Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

Authors: Jakub Hoscilowicz, Artur Janicki

Published: 2025-11-25

arXiv ID: 2511.20494v1

Added to Library: 2025-11-26 04:01 UTC

Red Teaming

📄 Abstract

We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Applications include embedding adversarial images into websites to prevent MLLM-powered agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.

🔍 Key Points

  • Introduction of the Adversarial Confusion Attack, designed specifically to disrupt multimodal large language models (MLLMs) by making them produce incoherent or confidently incorrect outputs.
  • The proposed attack utilizes a method that maximizes next-token entropy, destabilizing the model's decoding mechanisms and inducing high-confidence hallucinations.
  • Demonstrated the effectiveness of a single adversarial image perturbation that transfers its disruptive capabilities to unseen open-source and proprietary MLLMs.
  • Characterization of five distinct failure modes resulting from the attack, providing a comprehensive understanding of how MLLMs can fail under adversarial conditions.
  • Establishment of practical applications for the attack, including embedding adversarial images on websites to hinder MLLM-powered agents.

💡 Why This Paper Matters

The paper presents a novel approach to understanding and exploiting the vulnerabilities of multimodal large language models through the Adversarial Confusion Attack. By focusing on maximizing output entropy, it advances the field of adversarial attacks, shedding light on a new class of threats that could have significant implications for the reliability and safety of AI systems. The findings indicate the potential for this method to disrupt AI functionality in practical scenarios, making it a relevant area of study for enhancing AI security measures.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly of interest to AI security researchers as it unveils new methods of attack that target the fundamental decoding processes of AIs, rather than just content generation or classification errors. The introduction of the Adversarial Confusion Attack highlights vulnerabilities that could be exploited for malicious purposes, necessitating a deeper exploration of defenses against such attacks. Additionally, the study's implications for web-based AI interactions pose critical questions for secure AI deployment, making this research crucial for developing more robust AI systems.

📚 Read the Full Paper