← Back to Library

Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models

Authors: Youze Wang, Wenbo Hu, Yinpeng Dong, Jing Liu, Hanwang Zhang, Richang Hong

Published: 2025-06-02

arXiv ID: 2506.01307v1

Added to Library: 2025-06-04 04:03 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) have evolved into Multimodal Large Language Models (MLLMs), significantly enhancing their capabilities by integrating visual information and other types, thus aligning more closely with the nature of human intelligence, which processes a variety of data forms beyond just text. Despite advancements, the undesirable generation of these models remains a critical concern, particularly due to vulnerabilities exposed by text-based jailbreak attacks, which have represented a significant threat by challenging existing safety protocols. Motivated by the unique security risks posed by the integration of new and old modalities for MLLMs, we propose a unified multimodal universal jailbreak attack framework that leverages iterative image-text interactions and transfer-based strategy to generate a universal adversarial suffix and image. Our work not only highlights the interaction of image-text modalities can be used as a critical vulnerability but also validates that multimodal universal jailbreak attacks can bring higher-quality undesirable generations across different MLLMs. We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP, and reveal significant multimodal safety alignment issues, highlighting the inadequacy of current safety mechanisms against sophisticated multimodal attacks. This study underscores the urgent need for robust safety measures in MLLMs, advocating for a comprehensive review and enhancement of security protocols to mitigate potential risks associated with multimodal capabilities.

🔍 Key Points

  • Proposes a novel multimodal universal jailbreak attack framework leveraging iterative image-text interactions to exploit safety vulnerabilities in Multimodal Large Language Models (MLLMs).
  • Demonstrates that combining adversarial images and suffixes outperforms single modality approaches, revealing critical security risks associated with image-text interactions in MLLMs.
  • Validates the effectiveness of the proposed method against various MLLMs (LLaVA, Yi-VL, MiniGPT4, etc.) using different evaluation metrics (ASR, ASR-G).
  • Highlights significant gaps in current safety mechanisms for MLLMs and advocates for improved security protocols to mitigate risks from sophisticated multimodal attacks.
  • Provides insights into the necessity for robust cross-modal defenses in MLLMs and the need for ongoing research into the adversarial robustness of AI systems.

💡 Why This Paper Matters

This paper addresses a critical security threat in the rapidly evolving landscape of MLLMs by innovating a multimodal jailbreak attack methodology that highlights the vulnerabilities introduced by multimodal capabilities. Its findings urge researchers and practitioners to reassess the safety protocols applied to MLLMs, ensuring robust defenses against not only traditional attacks but also those exploiting interactions between text and images. This research is pivotal as it lays the groundwork for future advancements in AI safety measures and promotes a more secure utilization of AI technologies in sensitive applications.

🎯 Why It's Interesting for AI Security Researchers

The relevance of this paper to AI security researchers is significant, as it uncovers and analyzes the vulnerabilities posed by multimodal integrations in large language models. Given the increasing adoption of MLLMs in sensitive domains, understanding these types of adversarial attacks can influence how security protocols are designed and implemented, paving the way for more resilient AI systems in the face of sophisticated threats. Additionally, the methodologies and insights presented could inspire further explorations into effective defenses against multimodal adversarial threats.

📚 Read the Full Paper