← Back to Library

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Authors: Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, Tieyun Qian

Published: 2025-07-08

arXiv ID: 2507.06043v1

Added to Library: 2025-07-09 05:00 UTC

Red Teaming Safety

📄 Abstract

Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85\% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17\%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security The code and data are available at https://github.com/NLPGM/CAVGAN.

🔍 Key Points

  • Proposes CAVGAN, a unified framework for both jailbreak attacks and defense mechanisms in large language models (LLMs), highlighting their interdependence.
  • Demonstrates a successful jailbreak attack success rate of 88.85% and a defense success rate of 84.17% across three popular LLMs, validating the effectiveness and robustness of their approach.
  • Utilizes the concept of Concept Activation Vectors (CAV) and Generative Adversarial Networks (GANs) to create a new methodology for identifying the security judgment boundary of LLMs.
  • Explores the internal representation spaces of LLMs to improve the understanding of vulnerabilities and enhance the security mechanisms against malicious prompts and behaviors.
  • Findings suggest that meticulously designed adversarial training can improve defense strategies without requiring fine-tuning of LLM parameters.

💡 Why This Paper Matters

This paper provides significant insights into the duality of attack and defense in LLM security, presenting a novel methodology that enhances the understanding and robustness of model protections against jailbreak attempts. By merging these two typically isolated domains, the authors lay a foundation for more effective and unified security strategies in the evolving landscape of AI models.

🎯 Why It's Interesting for AI Security Researchers

This research is crucial for AI security researchers as it not only addresses vulnerabilities inherent in LLMs but also proposes a systematic approach to strengthen defenses. The innovative use of CAVs and GANs to facilitate better understanding and enhancement of security mechanisms is particularly relevant given the increasing sophistication of adversarial attacks in AI systems. Understanding these dynamics can inspire further research and development of secure AI systems.

📚 Read the Full Paper