← Back to Library

Proactive defense against LLM Jailbreak

Authors: Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, Junfeng Yang

Published: 2025-10-06

arXiv ID: 2510.05052v1

Added to Library: 2025-10-07 04:00 UTC

Red Teaming Safety

📄 Abstract

The proliferation of powerful large language models (LLMs) has necessitated robust safety alignment, yet these models remain vulnerable to evolving adversarial attacks, including multi-turn jailbreaks that iteratively search for successful queries. Current defenses, primarily reactive and static, often fail to counter these search-based attacks. In this paper, we introduce ProAct, a novel proactive defense framework designed to disrupt and mislead autonomous jailbreaking processes. Our core idea is to intentionally provide adversaries with "spurious responses" that appear to be results of successful jailbreak attacks but contain no actual harmful content. These misleading responses provide false signals to the attacker's internal optimization loop, causing the adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, our method consistently and significantly reduces attack success rates by up to 92\%. When combined with other defense frameworks, it further reduces the success rate of the latest attack strategies to 0\%. ProAct represents an orthogonal defense strategy that can serve as an additional guardrail to enhance LLM safety against the most effective jailbreaking attacks.

🔍 Key Points

  • Introduction of ProAct, a proactive defense framework designed to mislead adversarial jailbreaking efforts by generating spurious responses.
  • ProAct achieves up to 92% reduction in Attack Success Rates (ASR) across various LLMs and adversarial strategies, showcasing its robust effectiveness.
  • The framework employs a three-agent system consisting of a User Intent Analyzer, ProAct Defender, and Surrogate Evaluator to accurately assess and mitigate potential jailbreaking attempts.
  • ProAct demonstrates orthogonality with existing defense strategies, resulting in additional reductions in ASR when combined with methods like input/output filtering and inference guidance.
  • The study emphasizes a favorable safety-utility trade-off, maintaining model performance while significantly enhancing safety against jailbreaking attacks.

💡 Why This Paper Matters

This paper presents ProAct as a pioneering approach to enhancing LLM safety through proactive measures, representing a significant shift from traditional reactive strategies. Its high effectiveness in reducing jailbreak success rates establishes it as a vital contribution to the ongoing development of safe AI systems, ensuring that LLMs remain aligned with their safety and ethical guidelines.

🎯 Why It's Interesting for AI Security Researchers

This paper is critical for AI security researchers as it addresses the pressing vulnerability of LLMs to adversarial attacks. By exploring proactive defensive measures such as those proposed in ProAct, researchers can gain insights into developing more robust AI models capable of resisting evolving adversarial strategies, thus advancing the field of AI safety.

📚 Read the Full Paper