โ† Back to Library

KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

Authors: Shuyuan Liu, Jiawei Chen, Xiao Yang, Hang Su, Zhaoxia Yin

Published: 2025-11-09

arXiv ID: 2511.07480v1

Added to Library: 2025-11-14 23:03 UTC

Red Teaming

๐Ÿ“„ Abstract

With the widespread application of large language models (LLMs) in various fields, the security challenges they face have become increasingly prominent, especially the issue of jailbreak. These attacks induce the model to generate erroneous or uncontrolled outputs through crafted inputs, threatening the generality and security of the model. Although existing defense methods have shown some effectiveness, they often struggle to strike a balance between model generality and security. Excessive defense may limit the normal use of the model, while insufficient defense may lead to security vulnerabilities. In response to this problem, we propose a Knowledge Graph Defense Framework (KG-DF). Specifically, because of its structured knowledge representation and semantic association capabilities, Knowledge Graph(KG) can be searched by associating input content with safe knowledge in the knowledge base, thus identifying potentially harmful intentions and providing safe reasoning paths. However, traditional KG methods encounter significant challenges in keyword extraction, particularly when confronted with diverse and evolving attack strategies. To address this issue, we introduce an extensible semantic parsing module, whose core task is to transform the input query into a set of structured and secure concept representations, thereby enhancing the relevance of the matching process. Experimental results show that our framework enhances defense performance against various jailbreak attack methods, while also improving the response quality of the LLM in general QA scenarios by incorporating domain-general knowledge.

๐Ÿ” Key Points

  • Introduction of a novel black-box defense framework (KG-DF) that leverages Knowledge Graphs (KG) to enhance security against jailbreak attacks on large language models (LLMs).
  • Development of an extensible semantic parsing module to improve the keyword extraction process, which addresses challenges posed by evolving attack strategies.
  • The framework integrates safety-related and general knowledge into LLM responses, allowing the model to maintain both security and generality during operation.
  • Experimental evaluations demonstrate KG-DF achieves near-zero attack success rates (ASR) while maintaining high generality in both open-source and closed-source LLMs, significantly outperforming existing defense methods.
  • The proposed defense setup also improves the overall response quality of LLMs in general question-and-answer scenarios, indicating practical applicability for real-world usage.

๐Ÿ’ก Why This Paper Matters

This paper addresses the critical issue of jailbreak attacks in large language models by proposing an innovative defense framework that utilizes Knowledge Graphs. Its dual focus on enhancing model security while preserving generality is particularly relevant in todayโ€™s landscape where LLMs are increasingly at risk from adversarial inputs. The findings not only indicate a significant improvement in defense strategies but also contribute to the broader discourse on AI safety and security measures necessary for the deployment of these models in sensitive applications.

๐ŸŽฏ Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it offers a comprehensive approach to defending against a significant vulnerability in LLMsโ€”jailbreak attacks. By employing Knowledge Graphs and advanced semantic processing techniques, the authors present a solution that could inspire further innovations in defensive mechanisms against adversarial attacks. Additionally, the empirical success of the KG-DF framework in enhancing both security performance and response quality presents valuable data and insights for future research in the intersection of AI safety and language model deployment.

๐Ÿ“š Read the Full Paper