← Back to Library

Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

Authors: Ruofan Wang, Xin Wang, Yang Yao, Xuan Tong, Xingjun Ma

Published: 2025-08-03

arXiv ID: 2508.01741v1

Added to Library: 2025-08-14 23:00 UTC

Red Teaming

📄 Abstract

Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface: vulnerabilities in the base VLM could be retained in fine-tuned variants, rendering them susceptible to transferable jailbreak attacks. To demonstrate this risk, we introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method in which the adversary has full access to the base VLM but no knowledge of the fine-tuned target's weights or training configuration. To improve jailbreak transferability across fine-tuned VLMs, SEA combines two key techniques: Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG). FTS generates transferable adversarial images by simulating the vision encoder's parameter shifts, while TPG is a textual strategy that steers the language decoder toward adversarially optimized outputs. Experiments on the Qwen2-VL family (2B and 7B) demonstrate that SEA achieves high transfer attack success rates exceeding 86.5% and toxicity rates near 49.5% across diverse fine-tuned variants, even those specifically fine-tuned to improve safety behaviors. Notably, while direct PGD-based image jailbreaks rarely transfer across fine-tuned VLMs, SEA reliably exploits inherited vulnerabilities from the base model, significantly enhancing transferability. These findings highlight an urgent need to safeguard fine-tuned proprietary VLMs against transferable vulnerabilities inherited from open-source foundations, motivating the development of holistic defenses across the entire model lifecycle.

🔍 Key Points

  • The paper introduces the Simulated Ensemble Attack (SEA), a novel grey-box method for executing transferable jailbreak attacks against fine-tuned Vision-Language Models (VLMs) without knowledge of their parameters.
  • SEA combines two innovative techniques: Fine-tuning Trajectory Simulation (FTS) for generating robust adversarial images by perturbing the vision encoder, and Targeted Prompt Guidance (TPG) for steering the language decoder toward desired harmful outputs.
  • Experiments demonstrate SEA's high transferability, achieving attack success rates over 86.5% and toxicity rates around 49.5% across various fine-tuned VLMs, including those trained specifically to enhance safety.
  • The study reveals that inherited vulnerabilities from base VLMs can significantly compromise the effectiveness of fine-tuning strategies aimed at improving safety, thereby exposing critical security gaps in VLM deployment.
  • The results highlight the urgency for the development of defense mechanisms that address transferable vulnerabilities during the entire lifecycle of model training and fine-tuning.

💡 Why This Paper Matters

This paper is crucial as it uncovers the vulnerabilities present within fine-tuned Vision-Language Models, providing empirical evidence that foundational models can serve as gateways for adversarial attacks. By introducing SEA, the research not only provides a novel method to exploit these vulnerabilities but also has significant implications for the safe deployment of AI systems in real-world applications. The findings emphasize the need for enhanced security measures in AI model development, particularly in safety-critical domains.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly interesting due to its focus on the gray-box attack model, which simulates a more realistic scenario where attackers exploit known vulnerabilities in publicly available models. The innovative techniques proposed in SEA challenge existing notions of model robustness, especially in the context of fine-tuning strategies, making it a relevant study for developing both offensive and defensive techniques in AI security.

📚 Read the Full Paper