GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

📄 Abstract

Large Language Models (LLMs) are increasingly susceptible to jailbreak attacks, which are adversarial prompts that bypass alignment constraints and induce unauthorized or harmful behaviors. These vulnerabilities undermine the safety, reliability, and trustworthiness of LLM outputs, posing critical risks in domains such as healthcare, finance, and legal compliance. In this paper, we propose GuardNet, a hierarchical filtering framework that detects and filters jailbreak prompts prior to inference. GuardNet constructs structured graphs that combine sequential links, syntactic dependencies, and attention-derived token relations to capture both linguistic structure and contextual patterns indicative of jailbreak behavior. It then applies graph neural networks at two levels: (i) a prompt-level filter that detects global adversarial prompts, and (ii) a token-level filter that pinpoints fine-grained adversarial spans. Extensive experiments across three datasets and multiple attack settings show that GuardNet substantially outperforms prior defenses. It raises prompt-level F$_1$ scores from 66.4\% to 99.8\% on LLM-Fuzzer, and from 67-79\% to over 94\% on PLeak datasets. At the token level, GuardNet improves F$_1$ from 48-75\% to 74-91\%, with IoU gains up to +28\%. Despite its structural complexity, GuardNet maintains acceptable latency and generalizes well in cross-domain evaluations, making it a practical and robust defense against jailbreak threats in real-world LLM deployments.

🔍 Key Points

Introduction of GuardNet, a hierarchical filtering framework utilizing graph neural networks for jailbreaking detection in large language models (LLMs).
Implementation of a unique hybrid graph construction approach that integrates sequential, syntactic, and attention-derived token relationships to enhance adversarial detection capabilities.
Demonstrated performance superiority over existing methods, achieving prompt-level F1 scores up to 99.8% and significant improvements at the token level, with IoU gains up to +28%.
Provided a two-stage detection pipeline that efficiently filters adversarial prompts and localizes harmful token spans without requiring modifications to the underlying LLM architecture.
Presented thorough experimental evaluations across diverse datasets and attack settings, confirming GuardNet’s cross-domain generalizability and robustness.

💡 Why This Paper Matters

This paper is important as it addresses the increasing vulnerabilities of LLMs to jailbreak attacks through a novel, holistic framework that enhances input safety measures. GuardNet's advanced architecture not only effectively identifies and mitigates adversarial prompts but also ensures minimal latency and high cross-domain performance, making it a promising solution for real-world applications in critical sectors such as healthcare and finance.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper relevant because it provides insights into emerging threats associated with LLMs and introduces a sophisticated defense mechanism that utilizes cutting-edge techniques such as graph attention filtering. The comprehensive analysis and experimentation provided in the paper offer valuable benchmarks and methodologies that can inform future research in adversarial machine learning and prompt-based attack prevention.

GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper