← Back to Library

Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models

Authors: Jun Leng, Litian Zhang, Xi Zhang

Published: 2025-12-03

arXiv ID: 2512.03356v1

Added to Library: 2025-12-04 03:00 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) have become foundational in AI systems, yet they remain vulnerable to adversarial jailbreak attacks. These attacks involve carefully crafted prompts that bypass safety guardrails and induce models to produce harmful content. Detecting such malicious input queries is therefore critical for maintaining LLM safety. Existing methods for jailbreak detection typically involve fine-tuning LLMs as static safety LLMs using fixed training datasets. However, these methods incur substantial computational costs when updating model parameters to improve robustness, especially in the face of novel jailbreak attacks. Inspired by immunological memory mechanisms, we propose the Multi-Agent Adaptive Guard (MAAG) framework for jailbreak detection. The core idea is to equip guard with memory capabilities: upon encountering novel jailbreak attacks, the system memorizes attack patterns, enabling it to rapidly and accurately identify similar threats in future encounters. Specifically, MAAG first extracts activation values from input prompts and compares them to historical activations stored in a memory bank for quick preliminary detection. A defense agent then simulates responses based on these detection results, and an auxiliary agent supervises the simulation process to provide secondary filtering of the detection outcomes. Extensive experiments across five open-source models demonstrate that MAAG significantly outperforms state-of-the-art (SOTA) methods, achieving 98% detection accuracy and a 96% F1-score across a diverse range of attack scenarios.

🔍 Key Points

  • Introduction of the Multi-Agent Adaptive Guard (MAAG) framework for detecting jailbreak attacks in large language models (LLMs).
  • MAAG utilizes an immunological memory mechanism to recognize and memorize attack patterns, enhancing adaptive detection capabilities against evolving threats.
  • Experimental results demonstrate MAAG's superiority over existing state-of-the-art methods, achieving up to 98% detection accuracy and 96% F1-score across various models and attack types.
  • The framework's dual-agent response simulation approach assesses the generated responses for safety, adding an extra layer of verification post-detection.
  • MAAG’s design allows for continuous learning from new attacks, making it robust against real-world adversarial strategies without requiring frequent model retraining.

💡 Why This Paper Matters

The introduction of the MAAG framework represents a significant advancement in the robustness of jailbreak detection for LLMs. By effectively combining adaptive memory strategies with dual-agent response simulation, MAAG addresses critical vulnerabilities in existing static detection approaches. This framework not only enhances detection accuracy but also ensures that LLMs can learn and adapt to novel threats, making it highly relevant for maintaining AI safety in a rapidly evolving landscape.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly interesting to AI security researchers due to its novel integration of biological principles, specifically immunological memory, into the realm of adversarial attack detection. MAAG offers a practical solution to the significant challenge of evolving jailbreak attacks, which pose a real threat to the integrity of LLMs. The findings provide a compelling basis for future research aimed at enhancing AI safety mechanisms, making it an essential contribution to the field.

📚 Read the Full Paper