← Back to Library

AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

Authors: Rui Yang, Michael Fu, Chakkrit Tantithamthavorn, Chetan Arora, Gunel Gulmammadova, Joey Chua

Published: 2025-09-21

arXiv ID: 2509.16861v1

Added to Library: 2025-09-23 04:02 UTC

Red Teaming Safety

📄 Abstract

Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software. Unlike traditional rule-based systems with limited, predefined input-output spaces that inherently constrain unsafe behavior, LLMs enable open-ended, intelligent interactions--opening the door to jailbreak attacks through user inputs. Guardrails serve as a protective layer, filtering unsafe prompts before they reach the LLM. However, prior research shows that jailbreak attacks can still succeed over 70% of the time, even against advanced models like GPT-4o. While guardrails such as LlamaGuard report up to 95% accuracy, our preliminary analysis shows their performance can drop sharply--to as low as 12%--when confronted with unseen attacks. This highlights a growing software engineering challenge: how to build a post-deployment guardrail that adapts dynamically to emerging threats? To address this, we propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs and learns to defend against them through a continual learning framework. Through empirical evaluation, AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation, outperforming other baselines. These results demonstrate that AdaptiveGuard is a guardrail capable of evolving in response to emerging jailbreak strategies post deployment. We release our AdaptiveGuard and studied datasets at https://github.com/awsm-research/AdaptiveGuard to support further research.

🔍 Key Points

  • Introduction of AdaptiveGuard, a new guardrail framework designed to dynamically adapt to emerging jailbreak attacks by detecting out-of-distribution (OOD) inputs.
  • Achieves a 96% OOD detection accuracy and retains over 85% F1-score on in-distribution data after adaptation, demonstrating superior performance compared to existing guardrails like LlamaGuard.
  • Utilizes a continual learning approach with Low-Rank Adaptation (LoRA) to enable quick adaptation to novel attacks with minimal re-training requirements, reaching optimal defense success rates in just 2-38 update steps.
  • Provides empirical evidence that AdaptiveGuard can effectively balance defense against new OOD inputs while maintaining knowledge of previously encountered data, addressing the problem of catastrophic forgetting.
  • Releases the AdaptiveGuard model and datasets to support further research and development in adaptive guardrail technologies.

💡 Why This Paper Matters

This paper highlights the significant challenges faced by traditional guardrails in securing LLM-powered applications against evolving threats. By proposing AdaptiveGuard, the authors provide a robust solution capable of dynamic adaptation, which is crucial for maintaining safety in real-world deployment scenarios where threats constantly evolve. The high detection accuracy and minimal performance degradation underscore the practical implications of this work for the field of AI safety and security.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant as it addresses the pressing issue of jailbreak attacks on large language models (LLMs), a concern that is becoming increasingly important as these models are widely deployed. The introduction of AdaptiveGuard represents a significant advancement in the area of adaptive guardrails, showcasing novel methodologies that enhance the resilience and effectiveness of safety mechanisms. The techniques and findings presented could inform future research and development efforts aimed at improving model robustness and security protocols.

📚 Read the Full Paper