Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

📄 Abstract

Large Language Models (LLMs) have developed rapidly in web services, delivering unprecedented capabilities while amplifying societal risks. Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real-world web contexts. To mitigate these challenges, we propose ACE-Safety (Adversarial Co-Evolution for LLM Safety), a novel framework that jointly optimize attack and defense models by seamlessly integrating two key innovative procedures: (1) Group-aware Strategy-guided Monte Carlo Tree Search (GS-MCTS), which efficiently explores jailbreak strategies to uncover vulnerabilities and generate diverse adversarial samples; (2) Adversarial Curriculum Tree-aware Group Policy Optimization (AC-TGPO), which jointly trains attack and defense LLMs with challenging samples via curriculum reinforcement learning, enabling robust mutual improvement. Evaluations across multiple benchmarks demonstrate that our method outperforms existing attack and defense approaches, and provides a feasible pathway for developing LLMs that can sustainably support responsible AI ecosystems.

🔍 Key Points

Introduction of ACE-Safety, a framework for jointly optimizing attack and defense models in the context of Large Language Models (LLMs) to enhance safety and mitigate risks associated with jailbreak attempts.
Development of two innovative methods: Group-aware Strategy-guided Monte Carlo Tree Search (GS-MCTS) for exploring jailbreak strategies and Adversarial Curriculum Tree-aware Group Policy Optimization (AC-TGPO) to enhance model training through adversarial curriculum learning.
Extensive experimental evaluations demonstrating ACE-Safety's superiority in both attack success rate and defense performance against existing state-of-the-art approaches across multiple benchmarks.
ACE-Safety maintains a balance between safety and model helpfulness/responsibility, ensuring that the models not only defend against harmful outputs but also continue providing useful responses.
Analysis of various jailbreak strategies and their effectiveness, coupled with a detailed ablation study showcasing the contribution of each component in the ACE-Safety framework.

💡 Why This Paper Matters

This paper is significant as it addresses the critical challenges in ensuring the safe deployment of LLMs amidst evolving threats. By proposing a novel co-evolutionary approach that optimally aligns adversarial attack and defense strategies, it provides a pathway for enhancing AI safety in real-world applications. Furthermore, it emphasizes the need for mutual improvement between attacks and defenses, thus contributing to the development of more robust AI systems.

🎯 Why It's Interesting for AI Security Researchers

The research presented in this paper would be of interest to AI security researchers as it offers a structured methodology to tackle the adversarial vulnerabilities of LLMs. The novel frameworks proposed not only improve understanding of the dynamic interplay between attacks and defenses but also enhance overall AI safety—a growing concern in the deployment of powerful language models in sensitive contexts. Additionally, the findings could lead to further advancements in securing AI systems against increasingly sophisticated adversarial methods.

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper