← Back to Library

Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models

Authors: Maria Carolina Cornelia Wit, Jun Pang

Published: 2025-06-30

arXiv ID: 2506.23576v1

Added to Library: 2025-07-01 04:01 UTC

Red Teaming

📄 Abstract

Recent advances in large language models (LLMs) have raised concerns about jailbreaking attacks, i.e., prompts that bypass safety mechanisms. This paper investigates the use of multi-agent LLM systems as a defence against such attacks. We evaluate three jailbreaking strategies, including the original AutoDefense attack and two from Deepleaps: BetterDan and JB. Reproducing the AutoDefense framework, we compare single-agent setups with two- and three-agent configurations. Our results show that multi-agent systems enhance resistance to jailbreaks, especially by reducing false negatives. However, its effectiveness varies by attack type, and it introduces trade-offs such as increased false positives and computational overhead. These findings point to the limitations of current automated defences and suggest directions for improving alignment robustness in future LLM systems.

🤖 AI Analysis

AI analysis is not available for this paper. This may be because the paper was not deemed relevant for AI security topics, or the analysis failed during processing.

📚 Read the Full Paper