← Back to Library

Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models

Authors: Palash Nandi, Maithili Joshi, Tanmoy Chakraborty

Published: 2025-07-18

arXiv ID: 2507.13761v1

Added to Library: 2025-07-21 04:00 UTC

Red Teaming

📄 Abstract

Language models are highly sensitive to prompt formulations - small changes in input can drastically alter their output. This raises a critical question: To what extent can prompt sensitivity be exploited to generate inapt content? In this paper, we investigate how discrete components of prompt design influence the generation of inappropriate content in Visual Language Models (VLMs). Specifically, we analyze the impact of three key factors on successful jailbreaks: (a) the inclusion of detailed visual information, (b) the presence of adversarial examples, and (c) the use of positively framed beginning phrases. Our findings reveal that while a VLM can reliably distinguish between benign and harmful inputs in unimodal settings (text-only or image-only), this ability significantly degrades in multimodal contexts. Each of the three factors is independently capable of triggering a jailbreak, and we show that even a small number of in-context examples (as few as three) can push the model toward generating inappropriate outputs. Furthermore, we propose a framework that utilizes a skip-connection between two internal layers of the VLM, which substantially increases jailbreak success rates, even when using benign images. Finally, we demonstrate that memes, often perceived as humorous or harmless, can be as effective as toxic visuals in eliciting harmful content, underscoring the subtle and complex vulnerabilities of VLMs.

🔍 Key Points

  • This paper introduces SKIP-CON, a novel methodology that utilizes skip-connections between internal layers of Visual Language Models (VLMs) to enhance jailbreak success rates, showing a significant increase of up to 181% in some models.
  • The study highlights the critical roles of prompt design elements, specifically detailed visual information, adversarial examples, and positively framed starters, in influencing the generation of harmful content by VLMs.
  • Through extensive experiments, findings reveal that VLMs can more effectively distinguish between benign and harmful inputs in unimodal contexts, but struggle in multimodal settings, emphasizing vulnerabilities in current VLM architectures.
  • The research discusses how even benign images can trigger inappropriate outputs when combined with certain prompt elements, illustrating the nuanced and complex vulnerabilities of VLMs to adversarial attacks like jailbreaking.
  • The paper's dataset and methodologies are made publicly available, enabling further research and exploration in the area of VLM safety and adversarial robustness.

💡 Why This Paper Matters

This paper is highly relevant as it addresses the emerging concern of prompt sensitivity and adversarial attacks in Visual Language Models, which is crucial for ensuring the safe deployment of such models. The introduction of the SKIP-CON methodology and the findings on prompt design vulnerabilities contribute significantly to the field, offering insights that can lead to more robust and secure AI systems.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of great interest because it identifies specific vulnerabilities in VLMs, which are increasingly being deployed in sensitive applications. Understanding how adversarial inputs can manipulate these models is vital for developing better security measures and improving the reliability of AI systems in real-world scenarios.

📚 Read the Full Paper