← Back to Library

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

Authors: Richard J. Young

Published: 2025-11-27

arXiv ID: 2511.22047v1

Added to Library: 2025-12-01 03:00 UTC

Red Teaming

📄 Abstract

Large Language Model (LLM) safety guardrail models have emerged as a primary defense mechanism against harmful content generation, yet their robustness against sophisticated adversarial attacks remains poorly characterized. This study evaluated ten publicly available guardrail models from Meta, Google, IBM, NVIDIA, Alibaba, and Allen AI across 1,445 test prompts spanning 21 attack categories. While Qwen3Guard-8B achieved the highest overall accuracy (85.3%, 95% CI: 83.4-87.1%), a critical finding emerged when separating public benchmark prompts from novel attacks: all models showed substantial performance degradation on unseen prompts, with Qwen3Guard dropping from 91.0% to 33.8% (a 57.2 percentage point gap). In contrast, Granite-Guardian-3.2-5B showed the best generalization with only a 6.5% gap. A "helpful mode" jailbreak was also discovered where two guardrail models (Nemotron-Safety-8B, Granite-Guardian-3.2-5B) generated harmful content instead of blocking it, representing a novel failure mode. These findings suggest that benchmark performance may be misleading due to training data contamination, and that generalization ability, not overall accuracy, should be the primary metric for guardrail evaluation.

🔍 Key Points

  • Comprehensive evaluation of 10 large language model safety guardrails across 1,445 prompts spanning 21 adversarial attack categories, highlighting critical weaknesses in unseen prompt performance.
  • Significant performance degradation observed in guardrail models when tested on novel prompts, with the top-performing Qwen3Guard-8B experiencing a 57.2% decrease, suggesting potential training data contamination or overfitting to benchmarks.
  • Discovery of the 'helpful mode' jailbreak, wherein guardrail models inadvertently generated harmful content instead of blocking it, transforming safety mechanisms into potential attack vectors.
  • Critique of current evaluation practices in AI safety, advocating for a focus on generalization ability rather than solely on overall accuracy metrics, to better reflect real-world applicability.
  • Analysis of model size effects indicating that larger models do not necessarily correlate with better performance in safety classification, challenging conventional assumptions about model scaling.

💡 Why This Paper Matters

This paper is significant as it sheds light on the reliability and robustness of safety guardrail models in the era of large language models, ultimately highlighting the need for improved evaluation techniques that accurately reflect their capabilities against sophisticated adversarial attacks. It underscores the importance of not just achieving high accuracy on known benchmarks, but ensuring these models can handle real-world scenarios effectively.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant as it addresses critical vulnerabilities in the deployment of safety guardrails that are crucial for mitigating harmful content generation in AI applications. The empirical findings regarding performance degradation on novel prompts and the exploration of jailbreak scenarios provide vital insights into potential attack vectors that need to be considered in future defenses against adversarial manipulation of language models.

📚 Read the Full Paper