← Back to Library

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Authors: Tuan T. Nguyen, John Le, Thai T. Vu, Willy Susilo, Heath Cooper

Published: 2025-10-14

arXiv ID: 2510.13901v1

Added to Library: 2025-10-17 04:01 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.

🔍 Key Points

  • Introduction of RAID, a sophisticated framework designed to create adversarial suffixes for jailbreaking large language models (LLMs) while ensuring the generated content is coherent and contextually appropriate.
  • Implementation of a refusal-aware regularizer that actively steers the LLM away from refusal regions in the embedding space, improving the efficacy of the jailbreak process.
  • Utilization of critic-guided decoding that balances embedding affinity with language-model likelihood, which enhances the naturalness of the output while maintaining adversarial effectiveness.
  • Demonstration of RAID's superiority in terms of attack success rates (ASR) and computational efficiency compared to existing methods like PEZ, COLD-Attack, and GCG across multiple LLMs.
  • Highlighting the significance of embedding-space regularization as a tool for both attacking LLM vulnerabilities and informing future defense strategies.

💡 Why This Paper Matters

This paper is important because it provides a novel approach to analyzing and exploiting weaknesses in language models, which are increasingly deployed across sensitive applications. By developing the RAID technique, the researchers not only underscore vulnerabilities associated with LLMs but also pave the way for enhancing safety mechanisms against such jailbreak attacks.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant for AI security researchers as it tackles the pressing issue of jailbreak vulnerabilities in large language models. Understanding how to exploit these weaknesses highlights critical areas for developing robust defenses and improving the resilience of AI systems against adversarial manipulation. Moreover, the insights derived from embedding-space regularization and refusal-aware mechanisms could significantly contribute to the design of safer AI frameworks.

📚 Read the Full Paper