Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

Authors: Yuxuan Zhou, Yuzhao Peng, Yang Bai, Kuofeng Gao, Yihao Zhang, Yechao Zhang, Xun Chen, Tao Yu, Tao Dai, Shu-Tao Xia

Published: 2025-11-11

arXiv ID: 2511.08367v1

Added to Library: 2025-11-14 23:02 UTC

Red Teaming

📄 Abstract

Large Vision-Language Models (VLMs) are susceptible to jailbreak attacks: researchers have developed a variety of attack strategies that can successfully bypass the safety mechanisms of VLMs. Among these approaches, jailbreak methods based on the Out-of-Distribution (OOD) strategy have garnered widespread attention due to their simplicity and effectiveness. This paper further advances the in-depth understanding of OOD-based VLM jailbreak methods. Experimental results demonstrate that jailbreak samples generated via mild OOD strategies exhibit superior performance in circumventing the safety constraints of VLMs--a phenomenon we define as ''weak-OOD''. To unravel the underlying causes of this phenomenon, this study takes SI-Attack, a typical OOD-based jailbreak method, as the research object. We attribute this phenomenon to a trade-off between two dominant factors: input intent perception and model refusal triggering. The inconsistency in how these two factors respond to OOD manipulations gives rise to this phenomenon. Furthermore, we provide a theoretical argument for the inevitability of such inconsistency from the perspective of discrepancies between model pre-training and alignment processes. Building on the above insights, we draw inspiration from optical character recognition (OCR) capability enhancement--a core task in the pre-training phase of mainstream VLMs. Leveraging this capability, we design a simple yet highly effective VLM jailbreak method, whose performance outperforms that of SOTA baselines.

🔍 Key Points

Identification of the 'weak-OOD' phenomenon, explaining its impact on VLM jailbreak methods.
In-depth analysis of the trade-off between input intent perception and model refusal triggering as key factors in effective jailbreak attacks.
Proposing JOCR, a novel VLM jailbreak method leveraging OCR robustness, leading to superior performance over existing state-of-the-art methods.
Experimental validation through extensive ablation studies illustrating the efficacy of various perturbation strategies in maximally breaching model defenses.
Providing a theoretical framework linking VLM pre-training and alignment gaps to security vulnerabilities.

💡 Why This Paper Matters

This paper significantly advances the understanding of jailbreak attacks on Vision-Language Models by dissecting the mechanics behind the 'weak-OOD' phenomenon and contributing a novel jailbreak method (JOCR) that outperforms existing solutions. Its findings highlight critical areas in model training and alignment processes where improvements can bolster systemic security against adversarial attacks.

🎯 Why It's Interesting for AI Security Researchers

This paper is of particular interest to AI security researchers because it addresses pressing security vulnerabilities inherent in state-of-the-art VLMs. By elucidating the mechanics of jailbreak attacks and proposing effective strategies for defense, it contributes vital knowledge necessary for developing more robust and secure AI systems against emerging threats.

Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper