← Back to Library

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Authors: Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, Jiaheng Zhang

Published: 2025-09-29

arXiv ID: 2509.24296v1

Added to Library: 2025-09-30 04:01 UTC

Red Teaming

📄 Abstract

The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.

🔍 Key Points

  • The paper identifies unique vulnerabilities in Diffusion Large Language Models (dLLMs) through an in-depth analysis of intra-step and inter-step safety dynamics.
  • The authors introduce the concept of Denoising-path Dependence, demonstrating that early-stage token choices significantly impact final outputs and overall model safety.
  • The proposed DiffuGuard framework offers a novel, training-free dual-stage defense mechanism to enhance safety in dLLMs while preserving model performance.
  • Experimental results show DiffuGuard effectively reduces the Attack Success Rate (ASR) from 47.9% to 14.7% across multiple jailbreak methods, highlighting the framework's robustness.
  • The paper provides extensive evaluations and empirical studies, demonstrating the practical implications of their proposed safety measures.

💡 Why This Paper Matters

This paper is crucial as it systematically addresses the emerging safety concerns related to the novel architecture of diffusion models, proposes a practical defense mechanism, and validates its efficacy through rigorous experimentation. As dLLMs continue to gain popularity, ensuring their safe deployment is paramount for avoiding potential misuse and harmful outputs.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper valuable as it not only sheds light on the specific vulnerabilities associated with a new generation of LLMs but also introduces a pioneering defensive strategy. Understanding these mechanisms is essential for developing future safer AI systems and for protecting against adversarial exploitation, making this research a significant contribution to the field of AI safety.

📚 Read the Full Paper