← Back to Library

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

Authors: Shiji Zhao, Shukun Xiong, Yao Huang, Yan Jin, Zhenyu Wu, Jiyang Guan, Ranjie Duan, Jialing Tao, Hui Xue, Xingxing Wei

Published: 2025-12-05

arXiv ID: 2512.05853v1

Added to Library: 2025-12-08 03:00 UTC

Red Teaming

📄 Abstract

Multimodal Large Language Models (MLLMs) are widely used in various fields due to their powerful cross-modal comprehension and generation capabilities. However, more modalities bring more vulnerabilities to being utilized for jailbreak attacks, which induces MLLMs to output harmful content. Due to the strong reasoning ability of MLLMs, previous jailbreak attacks try to explore reasoning safety risk in text modal, while similar threats have been largely overlooked in the visual modal. To fully evaluate potential safety risks in the visual reasoning task, we propose Visual Reasoning Sequential Attack (VRSA), which induces MLLMs to gradually externalize and aggregate complete harmful intent by decomposing the original harmful text into several sequentially related sub-images. In particular, to enhance the rationality of the scene in the image sequence, we propose Adaptive Scene Refinement to optimize the scene most relevant to the original harmful query. To ensure the semantic continuity of the generated image, we propose Semantic Coherent Completion to iteratively rewrite each sub-text combined with contextual information in this scene. In addition, we propose Text-Image Consistency Alignment to keep the semantical consistency. A series of experiments demonstrates that the VRSA can achieve a higher attack success rate compared with the state-of-the-art jailbreak attack methods on both the open-source and closed-source MLLMs such as GPT-4o and Claude-4.5-Sonnet.

🔍 Key Points

  • Introduction of the Visual Reasoning Sequential Attack (VRSA), a novel method for conducting jailbreak attacks on Multimodal Large Language Models (MLLMs) by decomposing harmful intentions into sequential sub-images.
  • Development of Adaptive Scene Refinement to enhance the rationality of visual scenes relevant to harmful intents, which improves model understanding during the attack.
  • Semantic Coherent Completion is proposed to ensure continuity in the text narratives guiding the image sequences, enhancing the logical flow and consistency of the attack.
  • Text-Image Consistency Alignment technique ensures that there is high semantic similarity between the generated images and their corresponding text, boosting the attack's effectiveness.
  • Extensive experiments demonstrating that VRSA significantly outperforms existing state-of-the-art jailbreak methods on both open-source and closed-source MLLMs.

💡 Why This Paper Matters

The VRSA method represents a significant advancement in the understanding and exploitation of vulnerabilities within multimodal models, particularly focusing on the visual reasoning capabilities that have been overlooked in previous research. This paper's contributions are critical as they highlight potential safety risks associated with increasing complexity in AI systems, underscoring the need for improved security measures in MLLMs which are becoming increasingly integral in various applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it not only provides insights into specific vulnerabilities of MLLMs through a sophisticated attack method but also illustrates the broader implications of multimodal AI usage in sensitive domains. The findings and proposed methodologies open up new avenues for research into securing AI models against emerging hybrid attack vectors that leverage the inherent complexities of visual and textual interactions.

📚 Read the Full Paper