← Back to Library

Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models

Authors: Yunhan Zhao, Xiang Zheng, Xingjun Ma

Published: 2025-09-16

arXiv ID: 2509.12724v1

Added to Library: 2025-09-17 04:01 UTC

Red Teaming

📄 Abstract

Despite their superb capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks. While recent jailbreaks have achieved notable progress, their effectiveness and efficiency can still be improved. In this work, we reveal an interesting phenomenon: incorporating weak defense into the attack pipeline can significantly enhance both the effectiveness and the efficiency of jailbreaks on VLMs. Building on this insight, we propose Defense2Attack, a novel jailbreak method that bypasses the safety guardrails of VLMs by leveraging defensive patterns to guide jailbreak prompt design. Specifically, Defense2Attack consists of three key components: (1) a visual optimizer that embeds universal adversarial perturbations with affirmative and encouraging semantics; (2) a textual optimizer that refines the input using a defense-styled prompt; and (3) a red-team suffix generator that enhances the jailbreak through reinforcement fine-tuning. We empirically evaluate our method on four VLMs and four safety benchmarks. The results demonstrate that Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods that often require multiple tries. Our work offers a new perspective on jailbreaking VLMs.

🔍 Key Points

  • Defense2Attack proposes a novel bimodal jailbreak method for Vision-Language Models (VLMs) that integrates weak defenses into the attack pipeline, significantly improving both effectiveness and efficiency of jailbreaks.
  • The method includes three key components: a visual optimizer that uses positive semantics, a textual optimizer that disguises jailbreak intent with defense-styled prompts, and a suffix generator that fine-tunes outputs through reinforcement learning.
  • Defense2Attack achieves an approximately 80% attack success rate on open-source VLMs and 50% on commercial VLMs, outperforming existing methods that often require multiple attempts.
  • The technique demonstrates strong transferability across different VLMs and datasets, effectively extending its applicability beyond its training environment, which is crucial for real-world security threats.
  • The findings indicate a paradigm shift in how vulnerabilities in AI systems can be exploited, suggesting that defensive measures can paradoxically facilitate rather than hinder attacks.

💡 Why This Paper Matters

The paper presents significant advancements in the understanding and methodology of jailbreak attacks on Vision-Language Models. By introducing Defense2Attack, the authors not only address the shortcomings of prior methodologies by highlighting how weak defenses can be leveraged to create more effective attacks, but they also provide empirical evidence of its efficiency and transferability. This offers a fresh perspective on the safety mechanisms employed in VLMs and emphasizes the need for robust defenses against evolving attack strategies in AI.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it uncovers new vulnerabilities in Vision-Language Models and challenges existing assumptions about AI safety. The novel methodology presented could serve as a basis for further research into both offensive and defensive AI strategies, emphasizing the continuous arms race in AI security. Moreover, it provides critical insights into the complexities of multimodal AI systems and their potential exploits, prompting the need for more resilient AI architectures.

📚 Read the Full Paper