← Back to Library

Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

Authors: Zuoou Li, Weitong Zhang, Jingyuan Wang, Shuyuan Zhang, Wenjia Bai, Bernhard Kainz, Mengyun Qiao

Published: 2025-08-11

arXiv ID: 2508.09218v1

Added to Library: 2025-08-14 23:14 UTC

Red Teaming

📄 Abstract

Multimodal large language models (MLLMs) are widely used in vision-language reasoning tasks. However, their vulnerability to adversarial prompts remains a serious concern, as safety mechanisms often fail to prevent the generation of harmful outputs. Although recent jailbreak strategies report high success rates, many responses classified as "successful" are actually benign, vague, or unrelated to the intended malicious goal. This mismatch suggests that current evaluation standards may overestimate the effectiveness of such attacks. To address this issue, we introduce a four-axis evaluation framework that considers input on-topicness, input out-of-distribution (OOD) intensity, output harmfulness, and output refusal rate. This framework identifies truly effective jailbreaks. In a substantial empirical study, we reveal a structural trade-off: highly on-topic prompts are frequently blocked by safety filters, whereas those that are too OOD often evade detection but fail to produce harmful content. However, prompts that balance relevance and novelty are more likely to evade filters and trigger dangerous output. Building on this insight, we develop a recursive rewriting strategy called Balanced Structural Decomposition (BSD). The approach restructures malicious prompts into semantically aligned sub-tasks, while introducing subtle OOD signals and visual cues that make the inputs harder to detect. BSD was tested across 13 commercial and open-source MLLMs, where it consistently led to higher attack success rates, more harmful outputs, and fewer refusals. Compared to previous methods, it improves success rates by $67\%$ and harmfulness by $21\%$, revealing a previously underappreciated weakness in current multimodal safety systems.

🔍 Key Points

  • Introduction of a four-axis evaluation framework for assessing jailbreak effectiveness of MLLMs (On-topicness, OOD-intensity, harmfulness, refusal rate).
  • Development of the Balanced Structural Decomposition (BSD) method, which decomposes malicious prompts into semantically aligned sub-tasks, helping to balance on-topicness and OOD signals for evasion of safety filters.
  • Extensive empirical studies across 13 MLLMs, demonstrating BSD's capability to significantly improve attack success rates by 67% and harmful outputs by 21% over previous techniques.
  • Identifying a structural trade-off where prompts that balance relevance and novelty are more likely to evade detection and yield harmful outputs, providing insights into vulnerabilities of current safety mechanisms.
  • A comprehensive analysis linking input characteristics (On-topicness and OOD-intensity) with output outcomes (harmfulness and refusal rates) to enhance understanding of multimodal model behavior.

💡 Why This Paper Matters

This paper contributes significantly to the understanding of vulnerabilities in multimodal large language models (MLLMs) regarding adversarial prompts. By proposing the Balanced Structural Decomposition (BSD) framework and a novel evaluation strategy, it highlights critical weaknesses in current safety mechanisms and offers practical insights into designing more effective defenses. The findings underscore the importance of not only improving the performance of victim models but also enhancing their safety measures against sophisticated attack vectors, making it a pivotal resource in AI safety research.

🎯 Why It's Interesting for AI Security Researchers

This paper is of considerable interest to AI security researchers as it presents a novel method for systematically evaluating the effectiveness of jailbreak attacks on MLLMs. The introduction of the BSD method and the four-axis evaluation framework directly addresses the challenges of modern AI safety, revealing significant insights into the trade-offs between input characteristics and output responses. Understanding these relationships is crucial for developing robust countermeasures against adversarial attacks in multimodal systems, which are increasingly prevalent in AI applications.

📚 Read the Full Paper