← Back to Library

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Authors: Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang

Published: 2026-02-02

arXiv ID: 2602.01539v2

Added to Library: 2026-02-09 03:04 UTC

Safety

📄 Abstract

Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected data distributions}. In this paper, we introduce \textbf{MAGIC}, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a \textbf{co-evolution}, where the attacker's ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves \textbf{novel, previously unseen combinatorial strategies} through iterative RL training, underscoring our method's substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework's effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at https://github.com/BattleWen/MAGIC.

🔍 Key Points

  • Introduction of MAGIC, a novel multi-agent reinforcement learning framework that models LLM safety alignment as an asymmetric game.
  • Demonstrates co-evolution between an attacker that learns deceptive prompting strategies and a defender optimizing real-time refusal capabilities.
  • The framework improves upon existing models by allowing attackers to discover novel long-tail vulnerabilities and defenders to adapt quickly to new attack patterns.
  • Extensive experimental validation showing significant increases in defense success rates without sacrificing model helpfulness or benign compliance.
  • Theoretical insights on the existence of a more robust game equilibrium providing stronger safety guarantees.

💡 Why This Paper Matters

This paper presents substantial advancements in ensuring the safety of Large Language Models (LLMs) through an innovative approach that redefines the interaction dynamics between attackers and defenders in adversarial settings. By establishing a framework for co-evolution, it not only improves LLM robustness but also sets a precedent for future research on dynamic AI safety mechanisms. The introduction of the MAGIC framework is thus a critical contribution to the ongoing discourse on AI security and safety.

🎯 Why It's Interesting for AI Security Researchers

Given the evolving landscape of adversarial attacks on AI systems, this paper is highly relevant to AI security researchers who focus on developing proactive strategies for model safety and robustness. The co-evolving framework of MAGIC provides a new paradigm that moves beyond static defenses, which tend to be inadequate against sophisticated and adaptive adversarial prompts. Understanding and applying these findings could significantly enhance the resilience of AI models in real-world applications.

📚 Read the Full Paper