← Back to Library

Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs

Authors: Mingyu Yu, Lana Liu, Zhehao Zhao, Wei Wang, Sujuan Qin

Published: 2026-01-22

arXiv ID: 2601.15698v1

Added to Library: 2026-01-23 03:01 UTC

Red Teaming

📄 Abstract

The rapid advancement of Multimodal Large Language Models (MLLMs) has introduced complex security challenges, particularly at the intersection of textual and visual safety. While existing schemes have explored the security vulnerabilities of MLLMs, the investigation into their visual safety boundaries remains insufficient. In this paper, we propose Beyond Visual Safety (BVS), a novel image-text pair jailbreaking framework specifically designed to probe the visual safety boundaries of MLLMs. BVS employs a "reconstruction-then-generation" strategy, leveraging neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs, thereby leading MLLMs to be induced into generating harmful images. Experimental results demonstrate that BVS achieves a remarkable jailbreak success rate of 98.21\% against GPT-5 (12 January 2026 release). Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.

🔍 Key Points

  • Introduction of Beyond Visual Safety (BVS), a novel jailbreaking framework aimed at exploiting visual safety boundaries of Multimodal Large Language Models (MLLMs).
  • BVS employs a 'reconstruction-then-generation' strategy using neutralized visual splicing and inductive recomposition to conceal malicious intent and successfully induce harmful image generation.
  • Achieved a jailbreak success rate of 98.21% on GPT-5, significantly surpassing previous methods, highlighting vulnerabilities in current MLLM safety mechanisms.
  • Constructed a specialized benchmark dataset for rigorously testing MLLM visual safety, focused on high-severity categories that typically trigger refusal mechanisms in MLLMs.
  • Propose Multi-Image Distance Optimization Selection Algorithm (MIDOS) to enhance the effectiveness of patch selection in creating semantically neutralized composite images.

💡 Why This Paper Matters

This paper presents a significant advancement in understanding the security vulnerabilities of MLLMs, particularly concerning their visual safety mechanisms. By introducing the BVS framework, the authors expose critical weaknesses that can lead to the generation of harmful content, emphasizing the urgent need for improved safety measures in multimodal AI systems. The reported findings are crucial as they align with ongoing discussions in the AI ethics and safety communities regarding the potential for misuse of AI technologies.

🎯 Why It's Interesting for AI Security Researchers

This paper is of particular interest to AI security researchers as it highlights novel attack vectors against MLLMs, demonstrating how existing safety mechanisms can be bypassed through sophisticated attacks that exploit cross-modal capabilities. The proposed methods and findings not only contribute to the body of knowledge on AI security but also catalyze further investigations into enhancing model safety and robustness against such vulnerabilities.

📚 Read the Full Paper