Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks

📄 Abstract

Multimodal large language models (MLLMs) have demonstrated significant utility across diverse real-world applications. But MLLMs remain vulnerable to jailbreaks, where adversarial inputs can collapse their safety constraints and trigger unethical responses. In this work, we investigate jailbreaks in the text-vision multimodal setting and pioneer the observation that visual alignment imposes uneven safety constraints across modalities in MLLMs, thereby giving rise to multimodal safety asymmetry. We then develop PolyJailbreak, a black-box jailbreak method grounded in reinforcement learning. Initially, we probe the model's attention dynamics and latent representation space, assessing how visual inputs reshape cross-modal information flow and diminish the model's ability to separate harmful from benign inputs, thereby exposing exploitable vulnerabilities. On this basis, we systematize them into generalizable and reusable operational rules that constitute a structured library of Atomic Strategy Primitives, which translate harmful intents into jailbreak inputs through step-wise transformations. Guided by the primitives, PolyJailbreak employs a multi-agent optimization process that automatically adapts inputs against the target models. We conduct comprehensive evaluations on a variety of open-source and closed-source MLLMs, demonstrating that PolyJailbreak outperforms state-of-the-art baselines.

🔍 Key Points

Identification of Multimodal Safety Asymmetry: The paper highlights how visual alignment strategies weaken the safety mechanisms of Multimodal Large Language Models (MLLMs), making them more susceptible to jailbreak attacks.
Development of PolyJailbreak: A novel black-box method that utilizes reinforcement learning to automate the generation of adversarial inputs exploiting the identified vulnerabilities in MLLMs.
Creation of Atomic Strategy Primitives Library: The authors introduce a structured library of reusable operational rules that simplify the crafting of jailbreak prompts across various attacks.
Comprehensive Evaluation Results: PolyJailbreak demonstrates significant effectiveness over existing baselines, achieving high attack success rates on various MLLMs, showcasing its adaptability and robustness in different contexts.
Systematic Investigation of Vulnerabilities: The research provides a detailed analysis of how different alignment schemes and visual inputs interact to affect MLLM safety, contributing to a deeper understanding of attack vectors.

💡 Why This Paper Matters

This paper presents critical insights into the vulnerabilities of Multimodal Large Language Models, specifically revealing how visual inputs compromise their safety mechanisms. By developing PolyJailbreak, which effectively generates adversarial inputs, the authors offer a substantial contribution to the field of AI security. Their findings highlight the urgency with which researchers and developers must address these vulnerabilities to safeguard against potential misuse of MLLMs in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is essential as it not only uncovers fundamental vulnerabilities in widely used MLLMs but also provides a framework for understanding and exploiting these weaknesses. The systematic approach to analyzing multimodal interactions, combined with robust attack methodologies, lays the groundwork for future security measures and defenses against adversarial attacks in AI systems. The implications extend to enhancing safety protocols, informing design choices in model architectures, and promoting better alignment in AI technologies.

Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper