← Back to Library

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

Authors: Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu

Published: 2025-09-25

arXiv ID: 2509.21029v2

Added to Library: 2025-09-29 01:01 UTC

Red Teaming

📄 Abstract

The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.

🔍 Key Points

  • The study identifies the high-sharpness regions in the loss landscape of visual jailbreaking attacks, which leads to their limited transferability across different multimodal large language models (MLLMs).
  • The paper introduces a novel method called Feature Over-Reliance CorrEction (FORCE), designed to mitigate the attacks' dependency on model-specific early-layer features and high-frequency spectral components.
  • FORCE improves cross-model transferability of visual jailbreaking attacks by guiding them to explore broader feasible regions within the layer representations and rescaling frequency features based on their semantic relevance.
  • Extensive experiments demonstrate the effectiveness of FORCE across various MLLM architectures and datasets, showing significant improvements in attack success rates and reductions in query costs.
  • The findings underscore the importance of addressing vulnerabilities in MLLMs and contribute to advancing red-teaming methodologies in AI security.

💡 Why This Paper Matters

This paper presents crucial insights into the vulnerabilities of multimodal large language models and introduces an innovative approach, FORCE, which enhances the transferability of visual jailbreaking attacks. The results not only contribute to the theoretical understanding of these models' weaknesses but also provide practical methodologies for conducting effective red-teaming evaluations. Given the increasing deployment of MLLMs in sensitive applications, this research is vital for improving model security and safety.

🎯 Why It's Interesting for AI Security Researchers

This paper would capture the interest of AI security researchers as it addresses a significant and emerging area of concern: the vulnerabilities of multimodal large language models to visual jailbreaking attacks. The introduction of the FORCE method provides a new avenue for enhancing attack transferability, which is essential for evaluating the robustness of these models against adversarial manipulations. Researchers focused on AI safety and security can leverage the findings and methodologies outlined in this work to further understand and mitigate the risks associated with deploying powerful MLLMs in real-world applications.

📚 Read the Full Paper