← Back to Library

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Authors: Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang

Published: 2025-07-15

arXiv ID: 2507.11097v1

Added to Library: 2025-07-16 04:00 UTC

Red Teaming Safety

📄 Abstract

Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.

🔍 Key Points

  • Identification of a novel safety vulnerability in diffusion-based large language models (dLLMs), specifically related to the inadequacy of current alignment mechanisms against masked-input adversarial prompts.
  • Introduction of DIJA (Diffusion-based LLMs Jailbreak Attack), a systematic framework that exploits the bidirectional modeling and parallel decoding features of dLLMs to bypass existing safety defenses.
  • Comprehensive experimental results demonstrating the effectiveness of DIJA, achieving significantly higher attack success rates compared to existing jailbreak methods and showcasing the urgency for improved safety alignment strategies.
  • Highlighting the potential for harmful outputs generated by dLLMs due to their unique architecture, which contrasts with traditional autoregressive models in terms of dynamic filtering capabilities.
  • Empirical analysis revealing the robustness of DIJA against various defense mechanisms, indicating potential weaknesses in the current safety protocols for dLLMs.

💡 Why This Paper Matters

This paper is relevant as it exposes critical vulnerabilities in emerging diffusion-based large language models and proposes a novel attack framework that successfully circumvents existing safety measures. The findings emphasize the need for re-evaluating and enhancing safety alignment techniques to mitigate the risks associated with dLLMs, which are increasingly being adopted in various applications. By addressing these vulnerabilities, the research contributes to our understanding of LLM safety and the necessity for more robust defenses.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of particular interest to AI security researchers as it uncovers previously unexplored safety vulnerabilities within a new class of language models. The novel methods and frameworks introduced, alongside the empirical results, provide essential insights into the limitations of current safety protocols. Understanding these threats is crucial for developing advanced safety measures, ensuring responsible AI deployment, and informing future research on enhancing the robustness of language models against adversarial attacks.

📚 Read the Full Paper