Jailbreaks on Vision Language Model via Multimodal Reasoning

Authors: Aarush Noheria, Yuguang Yao

Published: 2026-01-29

arXiv ID: 2601.22398v1

Added to Library: 2026-02-03 08:08 UTC

Red Teaming

📄 Abstract

Vision-language models (VLMs) have become central to tasks such as visual question answering, image captioning, and text-to-image generation. However, their outputs are highly sensitive to prompt variations, which can reveal vulnerabilities in safety alignment. In this work, we present a jailbreak framework that exploits post-training Chain-of-Thought (CoT) prompting to construct stealthy prompts capable of bypassing safety filters. To further increase attack success rates (ASR), we propose a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback. This approach leverages the ReAct paradigm to refine adversarial noise in regions most likely to activate safety defenses, thereby enhancing stealth and evasion. Experimental results demonstrate that the proposed dual-strategy significantly improves ASR while maintaining naturalness in both text and visual domains.

🔍 Key Points

Introduction of a novel dual-strategy jailbreak framework that combines prompt rewriting and adaptive noising based on the Chain-of-Thought (CoT) paradigm to enhance stealthiness in VLMs.
The proposal of a black-box auditing mechanism that utilizes internal reasoning traces from vision-language models (VLMs) to evaluate safety, moving beyond traditional keyword filters.
Demonstration of significantly improved attack success rates (ASR) through extensive experiments compared to baseline methods, showcasing the effectiveness of both prompt rewriting and adaptive noising techniques.

💡 Why This Paper Matters

This paper sheds light on the vulnerabilities of vision-language models (VLMs) and introduces innovative methods to exploit these weaknesses. The dual-strategy approach presented offers a significant advancement in our understanding of adversarial attacks in multimodal AI systems, highlighting the critical importance of robust safety mechanisms in the deployment of VLMs.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is particularly relevant as it not only explores the vulnerabilities of vision-language models but also provides a practical framework to assess and potentially exploit these weaknesses. The findings underscore the necessity for improved defenses against such adversarial strategies, making it crucial for ongoing research in AI safety and security.

Jailbreaks on Vision Language Model via Multimodal Reasoning

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper