← Back to Library

Text is All You Need for Vision-Language Model Jailbreaking

Authors: Yihang Chen, Zhao Xu, Youyuan Jiang, Tianle Zheng, Cho-Jui Hsieh

Published: 2026-01-31

arXiv ID: 2602.00420v1

Added to Library: 2026-02-03 08:05 UTC

Red Teaming

📄 Abstract

Large Vision-Language Models (LVLMs) are increasingly equipped with robust safety safeguards to prevent responses to harmful or disallowed prompts. However, these defenses often focus on analyzing explicit textual inputs or relevant visual scenes. In this work, we introduce Text-DJ, a novel jailbreak attack that bypasses these safeguards by exploiting the model's Optical Character Recognition (OCR) capability. Our methodology consists of three stages. First, we decompose a single harmful query into multiple and semantically related but more benign sub-queries. Second, we pick a set of distraction queries that are maximally irrelevant to the harmful query. Third, we present all decomposed sub-queries and distraction queries to the LVLM simultaneously as a grid of images, with the position of the sub-queries being middle within the grid. We demonstrate that this method successfully circumvents the safety alignment of state-of-the-art LVLMs. We argue this attack succeeds by (1) converting text-based prompts into images, bypassing standard text-based filters, and (2) inducing distractions, where the model's safety protocols fail to link the scattered sub-queries within a high number of irrelevant queries. Overall, our findings expose a critical vulnerability in LVLMs' OCR capabilities that are not robust to dispersed, multi-image adversarial inputs, highlighting the need for defenses for fragmented multimodal inputs.

🔍 Key Points

  • Introduction of Text-DJ, a novel jailbreak attack for Large Vision-Language Models (LVLMs) that bypasses text-based safety filters using Optical Character Recognition (OCR) capabilities.
  • Three-stage methodology: (1) decomposing harmful queries into benign sub-queries, (2) constructing distraction queries that are unrelated to the harmful content, and (3) presenting them in a grid format to exploit semantic distractions.
  • Extensive experimentation shows that the proposed method consistently outperforms existing techniques (HADES and CS-DJ) in terms of Attack Success Rate (ASR).
  • Ablation studies confirm that the effectiveness of the attack relies on specific design choices such as the arrangement of queries and the use of distraction techniques to confuse the model's safety mechanisms.
  • The findings highlight significant vulnerabilities in the safety alignment of LVLMs when faced with fragmented multimodal inputs.

💡 Why This Paper Matters

The introduction of Text-DJ demonstrates a critical new avenue for exploring vulnerabilities in LVLMs, underscoring the need for improved safety protocols that account for OCR and multi-modal attacks. As LVLMs are increasingly deployed in real-world applications, understanding and addressing these vulnerabilities is essential for ensuring secure and safe AI interactions.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant for AI security researchers as it uncovers a novel attack vector against LVLMs, revealing weaknesses that traditional defenses may not cover. By exploring the intersection of OCR capabilities and textual safety measures, this research highlights the necessity for adaptive and holistic security frameworks capable of mitigating such multi-modal adversarial threats.

📚 Read the Full Paper