Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

Authors: Zonghao Ying, Haowen Dai, Lianyu Hu, Zonglei Jing, Quanchen Zou, Yaodong Yang, Aishan Liu, Xianglong Liu

Published: 2026-04-07

arXiv ID: 2604.05853v1

Added to Library: 2026-04-08 02:02 UTC

Red Teaming

📄 Abstract

Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.

🔍 Key Points

Introduction of the concept of "inscriptive jailbreaks" that utilize text-to-image models to generate harmful text embedded within benign imagery, representing a new attack vector that evades traditional visual filters.
Development of the Etch framework, which decomposes adversarial prompts into three layers—semantic camouflage, visual-spatial anchoring, and typographic encoding—to effectively navigate dual-layer safety filters.
Empirical evaluation across seven state-of-the-art text-to-image models, demonstrating an average attack success rate (ASR) of 65.57%, with peaks at 91.00%, significantly outperforming existing baselines.
Identification of critical vulnerabilities in current text-to-image safety mechanisms, highlighting the need for typography-aware defense strategies against these novel attack types.
Introduction of a new evaluation protocol using a vision-language model (VLM) as a diagnostic oracle for iterative prompt refinement, showcasing the potential of closed-loop feedback in prompt optimization.

💡 Why This Paper Matters

This paper is relevant as it sheds light on a critical and previously overlooked vulnerability in text-to-image models, emphasizing the significant risks posed by inscriptive jailbreaks. The novel methods introduced, particularly the Etch framework, not only advance the understanding of adversarial attacks on multimodal AI systems but also underscore the urgent need for enhanced defense mechanisms that consider the intricacies of typographic content. The findings serve as a wake-up call for researchers and practitioners in the field of AI security, urging the development of robust safety alignments that can effectively mitigate these emerging threats.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly interesting due to its exploration of a novel attack methodology that exploits existing text-to-image model capabilities, thereby revealing gaps in current safety protocols. The insights gained from the Etch framework and its empirical evaluations provide a critical foundation for further research into adversarial resilience, as well as the development of corresponding defense strategies against inscriptive threats. Additionally, the integration of VLMs for diagnostic purposes presents an innovative approach to enhancing the robustness of AI systems, which is a key focus area in the field of AI safety research.

Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper