← Back to Library

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Authors: Zhenyu Pan, Yiting Zhang, Zhuo Liu, Yolo Yunlong Tang, Zeliang Zhang, Haozheng Luo, Yuwei Han, Jianshu Zhang, Dennis Wu, Hong-Yu Chen, Haoran Lu, Haoyang Fang, Manling Li, Chenliang Xu, Philip S. Yu, Han Liu

Published: 2025-10-02

arXiv ID: 2510.01586v1

Added to Library: 2025-10-03 04:01 UTC

Red Teaming

📄 Abstract

LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each agent to pre-filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross-agent unsafe chains and delegation-induced risks; the latter increases system overhead and creates a single-point-of-failure-once compromised, system-wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group-level mean-return baseline, enabling lower-variance updates and stronger intra-group coordination. Across representative attack scenarios, AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving-and sometimes improving-task accuracy (up to +3.67% on reasoning tasks). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.

🔍 Key Points

  • Introduction of AdvEvo-MARL framework, which utilizes co-evolutionary reinforcement learning to enhance safety in multi-agent systems by simultaneously optimizing attacker and defender agents.
  • Implementation of a novel public baseline mechanism for advantage estimation, allowing agents of the same functional group to share performance metrics, thereby stabilizing training and enhancing intra-group coordination.
  • Successful reduction of attack success rate (ASR) to below 20% in multi-agent scenarios, demonstrating significant robustness against adversarial threats compared to existing methods which reached ASRs as high as 38.33%.
  • Preservation and enhancement of task performance across diverse benchmarks, indicating that safety mechanisms can be integrated without sacrificing the efficiency of task execution.
  • The framework demonstrates that dynamic attacker training leads to more robust defenders, suggesting a causal relationship between the evolving nature of threats and the adaptability of defenses.

💡 Why This Paper Matters

This paper presents AdvEvo-MARL as a significant advancement in ensuring the safety of multi-agent reinforcement learning systems, integrating safety directly into agents rather than relying on external protective mechanisms. Its findings indicate a dual benefit in enhancing both security and operational efficiency, making it a foundational reference for future developments in the field.

🎯 Why It's Interesting for AI Security Researchers

The proposed framework offers a comprehensive approach to tackling security vulnerabilities in multi-agent systems, presenting a mechanism that combines adversarial training with cooperative learning. Thus, it is of high relevance to AI security researchers aiming to improve robustness in language model applications, particularly as multi-agent configurations become more prevalent in complex AI systems.

📚 Read the Full Paper