← Back to Library

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Authors: Xiao Lin, Philip Li, Zhichen Zeng, Tingwei Li, Tianxin Wei, Xuying Ning, Gaotang Li, Yuzhong Chen, Hanghang Tong

Published: 2026-01-07

arXiv ID: 2601.03600v1

Added to Library: 2026-01-08 03:02 UTC

Red Teaming

📄 Abstract

Despite rich safety alignment strategies, large language models (LLMs) remain highly susceptible to jailbreak attacks, which compromise safety guardrails and pose serious security risks. Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data. However, few studies address the more realistic and challenging zero-shot jailbreak detection setting, where no jailbreak templates are available during training. This setting better reflects real-world scenarios where new attacks continually emerge and evolve. To address this challenge, we propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts. We uncover safety-relevant layers, identify specific modules that inherently encode zero-shot discriminative signals, and localize informative safety tokens. Building upon these insights, we introduce ALERT (Amplification-based Jailbreak Detector), an efficient and effective zero-shot jailbreak detector that introduces two independent yet complementary classifiers on amplified representations. Extensive experiments on three safety benchmarks demonstrate that ALERT achieves consistently strong zero-shot detection performance. Specifically, (i) across all datasets and attack strategies, ALERT reliably ranks among the top two methods, and (ii) it outperforms the second-best baseline by at least 10% in average Accuracy and F1-score, and sometimes by up to 40%.

🔍 Key Points

  • Introduction of a new zero-shot jailbreak detection framework using internal discrepancy amplification techniques.
  • Development of Alert, a model-agnostic detector that integrates layer-wise, module-wise, and token-wise amplifications to enhance discriminative signals for effective detection.
  • Extensive experimental results showing that Alert consistently outperforms existing detection methods by at least 10% in both accuracy and F1-score across various datasets and attack methods, demonstrating robustness and generalizability.
  • Identification of safety-relevant layers and modules within LLMs, providing insights into internal feature differentiations utile for zero-shot detection.
  • Establishing practicality principles (generalizability, efficiency, innocuousness) that ensure wider applicability of the proposed detector in real-world scenarios.

💡 Why This Paper Matters

This paper is crucial as it addresses the pressing issue of jailbreak attacks on large language models, proposing a novel detection framework that enhances safety measures amidst evolving attack methods. The presented methods and findings not only advance the technical landscape of AI security but also establish foundational principles for deploying effective jailbreak detection systems in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it provides innovative methodologies to detect sophisticated jailbreak attacks on large language models. It highlights a new paradigm in the detection landscape that does not rely on predefined templates, thus offering a significant leap in resilience against emerging threats. The findings encourage further exploration and enhancement of security measures in LLMs, making it a pivotal reference for ongoing research in AI safety.

📚 Read the Full Paper