← Back to Library

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

Authors: Zhiyi Mou, Jingyuan Yang, Zeheng Qian, Wangze Ni, Tianfang Xiao, Ning Liu, Chen Zhang, Zhan Qin, Kui Ren

Published: 2026-01-14

arXiv ID: 2601.09321v1

Added to Library: 2026-01-15 03:01 UTC

Red Teaming

📄 Abstract

While Large Language Models (LLMs) have powerful capabilities, they remain vulnerable to jailbreak attacks, which is a critical barrier to their safe web real-time application. Current commercial LLM providers deploy output guardrails to filter harmful outputs, yet these defenses are not impenetrable. Due to LLMs' reliance on autoregressive, token-by-token inference, their semantic representations lack robustness to spatially structured perturbations, such as redistributing tokens across different rows, columns, or diagonals. Exploiting the Transformer's spatial weakness, we propose SpatialJB to disrupt the model's output generation process, allowing harmful content to bypass guardrails without detection. Comprehensive experiments conducted on leading LLMs get nearly 100% ASR, demonstrating the high effectiveness of SpatialJB. Even after adding advanced output guardrails, like the OpenAI Moderation API, SpatialJB consistently maintains a success rate exceeding 75%, outperforming current jailbreak techniques by a significant margin. The proposal of SpatialJB exposes a key weakness in current guardrails and emphasizes the importance of spatial semantics, offering new insights to advance LLM safety research. To prevent potential misuse, we also present baseline defense strategies against SpatialJB and evaluate their effectiveness in mitigating such attacks. The code for the attack, baseline defenses, and a demo are available at https://anonymous.4open.science/r/SpatialJailbreak-8E63.

🔍 Key Points

  • Introduction of SpatialJB, a novel jailbreak attack exploiting the spatial weaknesses of Transformer architectures in LLMs by reorganizing textual content into two-dimensional layouts.
  • Demonstration of SpatialJB's high attack success rates with nearly 100% effectiveness across major LLMs and guardrails, significantly outperforming existing methods by over 80 percentage points in some cases.
  • Theoretical analyses revealing fundamental flaws in how Transformers process spatially structured inputs, highlighting their inability to maintain semantic coherence across non-sequential layouts.
  • Presentation of baseline defense strategies (SpatialD) aimed at mitigating the risks posed by SpatialJB, contributing to the discourse on LLM safety measures against such sophisticated attacks.
  • Comprehensive evaluation of the method against various guardrails, proving its ability to bypass modern safety checks designed to identify harmful output.

💡 Why This Paper Matters

This paper is crucial as it unveils significant vulnerabilities in LLMs, specifically how their architecture is exploited through spatial-format attacks. Such insights are essential in developing robust safety measures and improving the reliability of AI systems in real-world applications. By highlighting the inadequacies of current guardrails, it not only advances our understanding of adversarial attacks on LLMs but also sets the groundwork for enhanced protective strategies against AI misuse.

🎯 Why It's Interesting for AI Security Researchers

This research is of great interest to AI security researchers as it addresses the pressing issue of jailbreak attacks that can lead models to generate harmful content. The introduction of SpatialJB presents a new class of attack that challenges existing safety mechanisms, prompting the need for innovative defenses. As LLMs continue to proliferate in sensitive applications, understanding and mitigating such vulnerabilities will be paramount to ensuring safe and ethical AI usage.

📚 Read the Full Paper