← Back to Library

JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation

Authors: Md Jueal Mia, M. Hadi Amini

Published: 2025-09-24

arXiv ID: 2509.21401v1

Added to Library: 2025-09-29 04:02 UTC

Red Teaming

📄 Abstract

Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.

🔍 Key Points

  • Proposed the JaiLIP method, which employs loss-guided image perturbation to effectively jailbreak vision-language models while maintaining imperceptibility.
  • Showed that JaiLIP outperforms existing PGD-based methods in generating toxic text outputs, yielding higher toxicity scores while keeping the perturbations visually minimal.
  • Conducted extensive experimental evaluations on MiniGPT-4 and BLIP-2, demonstrating the method's effectiveness across different domains, including a transportation use case.
  • Introduced an optimization strategy that combines mean squared error (MSE) loss and the model's harmful-output loss, allowing for better balancing between attack effectiveness and visual similarity.
  • Highlighted the urgent need for robust defense mechanisms against image-based jailbreak attacks in multimodal systems.

💡 Why This Paper Matters

The JaiLIP paper presents a significant advancement in the understanding and exploitation of vulnerabilities in vision-language models. By providing a novel method for creating imperceptible adversarial images that cause harmful outputs, it sheds light on the potential risks associated with multimodal AI systems. This work is essential for both advancing AI robustness and addressing the ethical implications of such technologies.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it addresses emerging vulnerabilities in complex multimodal systems, particularly concerning the safety and alignment of vision-language models. Its exploration of new attack vectors and emphasis on the need for effective defenses provide a foundation for future research aimed at improving the resilience of AI systems against adversarial threats.

📚 Read the Full Paper