Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

📄 Abstract

Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface: vulnerabilities in the base VLM could be retained in fine-tuned variants, rendering them susceptible to transferable jailbreak attacks. To demonstrate this risk, we introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method in which the adversary has full access to the base VLM but no knowledge of the fine-tuned target's weights or training configuration. To improve jailbreak transferability across fine-tuned VLMs, SEA combines two key techniques: Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG). FTS generates transferable adversarial images by simulating the vision encoder's parameter shifts, while TPG is a textual strategy that steers the language decoder toward adversarially optimized outputs. Experiments on the Qwen2-VL family (2B and 7B) demonstrate that SEA achieves high transfer attack success rates exceeding 86.5% and toxicity rates near 49.5% across diverse fine-tuned variants, even those specifically fine-tuned to improve safety behaviors. Notably, while direct PGD-based image jailbreaks rarely transfer across fine-tuned VLMs, SEA reliably exploits inherited vulnerabilities from the base model, significantly enhancing transferability. These findings highlight an urgent need to safeguard fine-tuned proprietary VLMs against transferable vulnerabilities inherited from open-source foundations, motivating the development of holistic defenses across the entire model lifecycle.

🔍 Key Points

The paper introduces the Simulated Ensemble Attack (SEA), a novel grey-box method for executing transferable jailbreak attacks against fine-tuned Vision-Language Models (VLMs) without knowledge of their parameters.
SEA combines two innovative techniques: Fine-tuning Trajectory Simulation (FTS) for generating robust adversarial images by perturbing the vision encoder, and Targeted Prompt Guidance (TPG) for steering the language decoder toward desired harmful outputs.
Experiments demonstrate SEA's high transferability, achieving attack success rates over 86.5% and toxicity rates around 49.5% across various fine-tuned VLMs, including those trained specifically to enhance safety.
The study reveals that inherited vulnerabilities from base VLMs can significantly compromise the effectiveness of fine-tuning strategies aimed at improving safety, thereby exposing critical security gaps in VLM deployment.
The results highlight the urgency for the development of defense mechanisms that address transferable vulnerabilities during the entire lifecycle of model training and fine-tuning.

💡 Why This Paper Matters

This paper is crucial as it uncovers the vulnerabilities present within fine-tuned Vision-Language Models, providing empirical evidence that foundational models can serve as gateways for adversarial attacks. By introducing SEA, the research not only provides a novel method to exploit these vulnerabilities but also has significant implications for the safe deployment of AI systems in real-world applications. The findings emphasize the need for enhanced security measures in AI model development, particularly in safety-critical domains.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly interesting due to its focus on the gray-box attack model, which simulates a more realistic scenario where attackers exploit known vulnerabilities in publicly available models. The innovative techniques proposed in SEA challenge existing notions of model robustness, especially in the context of fine-tuning strategies, making it a relevant study for developing both offensive and defensive techniques in AI security.

Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper