← Back to Library

Diffusion LLMs are Natural Adversaries for any LLM

Authors: David Lüdke, Tom Wollschläger, Paul Ungermann, Stephan Günnemann, Leo Schwinn

Published: 2025-10-31

arXiv ID: 2511.00203v1

Added to Library: 2025-11-05 05:02 UTC

Red Teaming

📄 Abstract

We introduce a novel framework that transforms the resource-intensive (adversarial) prompt optimization problem into an \emph{efficient, amortized inference task}. Our core insight is that pretrained, non-autoregressive generative LLMs, such as Diffusion LLMs, which model the joint distribution over prompt-response pairs, can serve as powerful surrogates for prompt search. This approach enables the direct conditional generation of prompts, effectively replacing costly, per-instance discrete optimization with a small number of parallelizable samples. We provide a probabilistic analysis demonstrating that under mild fidelity assumptions, only a few conditional samples are required to recover high-reward (harmful) prompts. Empirically, we find that the generated prompts are low-perplexity, diverse jailbreaks that exhibit strong transferability to a wide range of black-box target models, including robustly trained and proprietary LLMs. Beyond adversarial prompting, our framework opens new directions for red teaming, automated prompt optimization, and leveraging emerging Flow- and Diffusion-based LLMs.

🔍 Key Points

  • Introduction of a framework called Inpainting that leverages diffusion-based LLMs for generating adversarial prompts more efficiently compared to traditional optimization methods.
  • Demonstrates that pretrained non-autoregressive generative models like Diffusion LLMs (DLLMs) can effectively model the joint distribution of prompt-response pairs, enabling direct conditional sampling of prompts.
  • Provides empirical evidence showing high transferability of generated prompts across a wide range of LLMs, including proprietary models, showcasing improved attack success rates (ASR).
  • The paper presents a probabilistic analysis that establishes guarantees on the efficacy of generating high-reward prompts using a small number of samples from the surrogate model.
  • Offers new avenues for red teaming and automated prompt optimization while highlighting potential ethical considerations and broader impacts of adversarial prompting.

💡 Why This Paper Matters

This paper is significant in the field of AI security as it presents a novel and efficient method for generating adversarial prompts using diffusion-based language models. The findings reveal the vulnerabilities of LLMs to adversarial attacks and underscore the need for enhanced defenses against such threats, making it crucial for researchers working on model safety.

🎯 Why It's Interesting for AI Security Researchers

The research presented in this paper is highly relevant to AI security researchers as it addresses the growing issue of adversarial attacks on large language models. By employing innovative methods for prompt optimization and providing insights into the effectiveness and efficiency of these attacks, it aids in understanding the potential risks associated with deploying LLMs in real-world applications. Additionally, the ethical considerations discussed highlight the responsibility developers must uphold in ensuring the safety and integrity of AI systems.

📚 Read the Full Paper