← Back to Library

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

Authors: Xurui Li, Kaisong Song, Rui Zhu, Pin-Yu Chen, Haixu Tang

Published: 2025-11-24

arXiv ID: 2511.19218v2

Added to Library: 2025-11-27 03:00 UTC

Red Teaming Safety

📄 Abstract

Large Language Models (LLMs) have developed rapidly in web services, delivering unprecedented capabilities while amplifying societal risks. Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real-world web contexts. To mitigate these challenges, we propose ACE-Safety (Adversarial Co-Evolution for LLM Safety), a novel framework that jointly optimize attack and defense models by seamlessly integrating two key innovative procedures: (1) Group-aware Strategy-guided Monte Carlo Tree Search (GS-MCTS), which efficiently explores jailbreak strategies to uncover vulnerabilities and generate diverse adversarial samples; (2) Adversarial Curriculum Tree-aware Group Policy Optimization (AC-TGPO), which jointly trains attack and defense LLMs with challenging samples via curriculum reinforcement learning, enabling robust mutual improvement. Evaluations across multiple benchmarks demonstrate that our method outperforms existing attack and defense approaches, and provides a feasible pathway for developing LLMs that can sustainably support responsible AI ecosystems.

🔍 Key Points

  • ACE-Safety introduces a novel framework for co-evolving adversarial attack and defense strategies for large language models (LLMs), addressing the dynamic interplay between threats and safeguards.
  • The framework encompasses two innovative approaches: Group-aware Strategy-guided Monte Carlo Tree Search (GS-MCTS) for effective jailbreak attack exploration, and Adversarial Curriculum Tree-aware Group Policy Optimization (AC-TGPO) for enhanced mutual training of attack and defense models.
  • Extensive evaluations demonstrate that ACE-Safety outperforms existing methods both in attack success rates (highest jailbreak success rate) and in defence robustness (better handling of various attack strategies), maintaining model helpfulness and responsibility.
  • The paper provides insights into the weaknesses of existing approaches, highlighting how isolation in attack and defense methods leads to vulnerabilities, and presents a holistic approach to mitigating these issues.

💡 Why This Paper Matters

This paper is relevant as it represents a significant advancement in the field of AI safety, particularly in the context of large language models. By developing a unified framework that considers both attack and defense mechanisms, it sets a precedent for future research in ensuring the responsible deployment of LLMs and promotes a sustainable AI ecosystem. The findings have practical implications for the ongoing battle against adversarial attacks on machine learning models, offering methodologies that could lead to safer AI systems in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

The research is of particular interest to AI security researchers as it tackles the pressing issue of safety alignment in LLMs, highlighting the need for adaptable strategies that can evolve alongside emerging threats. The presented methods for robust adversarial training and mutual enhancement between attack and defense models could provide foundational strategies for future developments in AI security, aiding in preemptive measures against malicious exploitation.

📚 Read the Full Paper