Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

📄 Abstract

Vision-Language Models (VLMs) are now a core part of modern AI. Recent work proposed several visual jailbreak attacks using single/ holistic images. However, contemporary VLMs demonstrate strong robustness against such attacks due to extensive safety alignment through preference optimization (e.g., RLHF). In this work, we identify a new vulnerability: while VLM pretraining and instruction tuning generalize well to split-image inputs, safety alignment is typically performed only on holistic images and does not account for harmful semantics distributed across multiple image fragments. Consequently, VLMs often fail to detect and refuse harmful split-image inputs, where unsafe cues emerge only after combining images. We introduce novel split-image visual jailbreak attacks (SIVA) that exploit this misalignment. Unlike prior optimization-based attacks, which exhibit poor black-box transferability due to architectural and prior mismatches across models, our attacks evolve in progressive phases from naive splitting to an adaptive white-box attack, culminating in a black-box transfer attack. Our strongest strategy leverages a novel adversarial knowledge distillation (Adv-KD) algorithm to substantially improve cross-model transferability. Evaluations on three state-of-the-art modern VLMs and three jailbreak datasets demonstrate that our strongest attack achieves up to 60% higher transfer success than existing baselines. Lastly, we propose efficient ways to address this critical vulnerability in the current VLM safety alignment.

🔍 Key Points

Identification of a novel vulnerability in Vision-Language Models (VLMs) regarding split-image harmful input attacks, demonstrating their susceptibility despite robustness to traditional single-image attacks.
Introduction of the Split-Image Visual Jailbreak Attacks (SIVA), which consist of three progressive phases: Naïve SIVA, Adaptive SIVA, and Transfer SIVA, enhancing the understanding of adversarial manipulation in VLMs.
Development of an efficient Adv-KD (Adversarial Knowledge Distillation) algorithm that significantly improves the transferability of adversarial attacks across different model architectures, highlighting weaknesses in current VLM defenses.
Proposed a novel augmentation strategy for Direct Preference Optimization (aDPO), which effectively addresses safety vulnerabilities in VLMs' training alignment processes with minimal human intervention and resources.
Comprehensive evaluation of proposed methods against state-of-the-art models and datasets, showcasing a significant increase in attack success rates and underscoring the need for robust defenses.

💡 Why This Paper Matters

This paper is important as it sheds light on new attack vectors against Vision-Language Models (VLMs), particularly through the exploitation of split-image vulnerabilities. The findings not only reveal critical weaknesses in existing VLM safety mechanisms but also contribute to the ongoing discourse on AI safety and security by proposing actionable methods to reinforce model resilience. With the increasing reliance on VLMs in various applications, understanding and mitigating such vulnerabilities is crucial for ensuring safe deployment in real-world scenarios.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper of significant interest as it highlights an emerging class of vulnerabilities that could be exploited in VLMs. The innovative methods proposed, such as the SIVA framework and Adv-KD algorithm, provide new insights into adversarial behavior in AI systems. Additionally, research into enhancing VLMs' robustness through improved alignment strategies contributes to the broader field of adversarial machine learning, making this work relevant for developing secure AI technologies.

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper