← Back to Library

PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

Authors: Quanchen Zou, Zonghao Ying, Moyang Chen, Wenzhuo Xu, Yisong Xiao, Yakai Li, Deyue Zhang, Dongdong Yang, Zhao Liu, Xiangzheng Zhang

Published: 2025-07-29

arXiv ID: 2507.21540v1

Added to Library: 2025-07-30 05:02 UTC

Red Teaming

πŸ“„ Abstract

The increasing sophistication of large vision-language models (LVLMs) has been accompanied by advances in safety alignment mechanisms designed to prevent harmful content generation. However, these defenses remain vulnerable to sophisticated adversarial attacks. Existing jailbreak methods typically rely on direct and semantically explicit prompts, overlooking subtle vulnerabilities in how LVLMs compose information over multiple reasoning steps. In this paper, we propose a novel and effective jailbreak framework inspired by Return-Oriented Programming (ROP) techniques from software security. Our approach decomposes a harmful instruction into a sequence of individually benign visual gadgets. A carefully engineered textual prompt directs the sequence of inputs, prompting the model to integrate the benign visual gadgets through its reasoning process to produce a coherent and harmful output. This makes the malicious intent emergent and difficult to detect from any single component. We validate our method through extensive experiments on established benchmarks including SafeBench and MM-SafetyBench, targeting popular LVLMs. Results show that our approach consistently and substantially outperforms existing baselines on state-of-the-art models, achieving near-perfect attack success rates (over 0.90 on SafeBench) and improving ASR by up to 0.39. Our findings reveal a critical and underexplored vulnerability that exploits the compositional reasoning abilities of LVLMs, highlighting the urgent need for defenses that secure the entire reasoning process.

πŸ” Key Points

  • Introduction of PRISM, a multimodal jailbreak attack framework inspired by Return-Oriented Programming techniques, demonstrating a novel approach to exploit compositional reasoning in LVLMs.
  • The PRISM method cleverly decomposes harmful instructions into benign visual gadgets and orchestrates their execution through controlled prompts, enhancing jailbreaking effectiveness.
  • Extensive experiments showing PRISM achieves significantly higher attack success rates (ASR over 0.90) against state-of-the-art LVLMs, including both open-source and commercial models.
  • Evaluation against multiple benchmark datasets (SafeBench and MM-SafetyBench) confirms the robustness and generalizability of the PRISM method across various models and categories of unsafe outputs.
  • Demonstration of PRISM's resilience to existing defense mechanisms, highlighting critical vulnerabilities in current LVLM safety alignment practices.

πŸ’‘ Why This Paper Matters

The paper is relevant and important as it exposes critical vulnerabilities within large vision-language models (LVLMs) by introducing a sophisticated jailbreak method. The findings underscore the urgent need for enhanced safety mechanisms and underscore the potential risks posed by advanced AI systems in generating harmful content. It illustrates how subtle manipulation of model reasoning processes can lead to significant security breaches, relevant for policymakers and developers of AI safety measures.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of interest to AI security researchers because it addresses a pressing issue in AI safety and ethicsβ€”the susceptibility of advanced multimodal models to jailbreak attacks. It provides insights into potential attack vectors and highlights the weaknesses of current defenses, making it crucial for researchers focused on developing robust security measures and ensuring the safe deployment of AI technologies.

πŸ“š Read the Full Paper