← Back to Library

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Authors: Kaiyuan Cui, Yige Li, Yutao Wu, Xingjun Ma, Sarah Erfani, Christopher Leckie, Hanxun Huang

Published: 2026-02-01

arXiv ID: 2602.01025v1

Added to Library: 2026-02-03 08:03 UTC

Red Teaming

📄 Abstract

Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \href{https://github.com/kaiyuanCui/UltraBreak}{GitHub repository}.

🔍 Key Points

  • Introduction of UltraBreak, a novel framework that offers universal and transferable jailbreak attacks against Vision-Language Models (VLMs), highlighting two key design strategies: constrained optimisation and semantically guided targets.
  • The research identifies limitations in existing gradient-based methods due to overfitting to surrogate models and proposes a new loss formulation that enhances generalisation and transferability across models and attack targets.
  • Extensive experiments demonstrate UltraBreak's superiority over previous methods, achieving higher Attack Success Rates (ASR) in various settings and providing a detailed analysis of failure modes in earlier approaches.
  • The paper emphasizes the need for more robust alignment strategies in VLMs, stressing that the integration of visual modalities amplifies potential vulnerabilities.
  • The findings underscore the importance of monitoring and mitigating risks associated with multimodal AI systems, especially as they are deployed in safety-critical applications.

💡 Why This Paper Matters

This paper is crucial as it not only advances the understanding of vulnerabilities within Vision-Language Models but also introduces a robust method for exploiting these weaknesses. As VLMs become increasingly integrated into various applications, the ability to identify and counteract potential security issues is paramount for maintaining trust and safety in AI systems.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper valuable because it addresses the intersection of adversarial machine learning and real-world applications of VLMs. The insights into vulnerabilities and the development of novel approaches for bypassing safety mechanisms contribute to the understanding of AI threats and safety measures. Given the rapid deployment of VLMs in sensitive areas like healthcare and autonomous systems, research on notable adversarial attacks is essential for developing effective defense strategies.

📚 Read the Full Paper