← Back to Library

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Authors: Zhihua Wei, Qiang Li, Jian Ruan, Zhenxin Qin, Leilei Wen, Dongrui Liu, Wen Shen

Published: 2026-03-18

arXiv ID: 2603.17372v1

Added to Library: 2026-03-19 02:01 UTC

Red Teaming

📄 Abstract

Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.

🔍 Key Points

  • The paper identifies that jailbreak samples in VLMs (vision-language models) form a distinct state in their representation space, separate from benign and refusal samples, indicating the models recognize harmful intent but fail to refuse in certain contexts.
  • A new hypothesis is proposed that the inclusion of images induces a 'jailbreak-related shift' in the representation space, leading to increased jailbreak success rates. This shift is quantitatively characterized and linked directly to various phenomena observed in jailbreak attacks.
  • The authors introduce JRS-Rem, a defense method that effectively removes the jailbreak-related shift from the model's representations at inference time, enhancing the model's safety without significantly compromising performance on benign tasks.
  • Empirical results across multiple VLM architectures (LLaVA-1.5-7B, ShareGPT4V-7B, and InternVL-Chat-19B) demonstrate that JRS-Rem significantly reduces attack success rates (ASR) in diverse scenarios, including explicitly harmful and adversarial contexts.

💡 Why This Paper Matters

This paper is relevant because it not only deepens the understanding of vulnerabilities in multimodal AI systems but also offers a practical solution to enhance their safety. By addressing the representation shifts that lead to jailbreaks, the method proposed shows promise in maintaining model utility while improving alignment and safety, paving the way for more robust AI applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper of great interest as it tackles a critical challenge in the field: the safety and robustness of vision-language models against manipulation and harmful inputs. The findings highlight a novel approach to quantifying and mitigating jailbreak risks, contributing to the development of secure AI systems. Furthermore, the techniques described could inspire further research into vulnerabilities and defenses in other multimodal systems, making it a significant addition to the literature in AI safety.

📚 Read the Full Paper