← Back to Library

LLM Jailbreak Detection for (Almost) Free!

Authors: Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, Jindong Gu

Published: 2025-09-18

arXiv ID: 2509.14558v1

Added to Library: 2025-09-19 04:02 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks capable of producing inappropriate content. Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences. However, existing methods entail significant computational costs. In this paper, we first present a finding that the difference in output distributions between jailbreak and benign prompts can be employed for detecting jailbreak prompts. Based on this finding, we propose a Free Jailbreak Detection (FJD) which prepends an affirmative instruction to the input and scales the logits by temperature to further distinguish between jailbreak and benign prompts through the confidence of the first token. Furthermore, we enhance the detection performance of FJD through the integration of virtual instruction learning. Extensive experiments on aligned LLMs show that our FJD can effectively detect jailbreak prompts with almost no additional computational costs during LLM inference.

🔍 Key Points

  • Introduction of the Free Jailbreak Detection (FJD) method that uses affirmative instruction prepending and temperature scaling to efficiently detect jailbreak prompts in language models with minimal computational costs.
  • Demonstrates significant detection capability across various Large Language Models (LLMs) under different forms of jailbreak attacks, outperforming baseline detection methods.
  • Findings show that the difference in output distributions based on confidence scores of the first token can effectively distinguish between benign and jailbreak prompts.
  • A new learning mechanism for 'virtual instructions' (FJD-LI) is proposed to enhance detection performance further, requiring only a small amount of training data.
  • Extensive experimental validation is conducted, showing FJD's effectiveness against both targeted and transferable jailbreak attacks while maintaining operational efficiency.

💡 Why This Paper Matters

This paper presents a vital contribution to the field of AI safety by proposing an effective method for detecting harmful jailbreak prompts in large language models with minimal computational overhead. The introduction of FJD offers a promising direction for future research and practical applications in enhancing the security of language models, making it an essential read for both researchers and practitioners in AI alignment and safety.

🎯 Why It's Interesting for AI Security Researchers

The findings and methods presented in this paper are particularly significant for AI security researchers, as they address critical vulnerabilities in language models. By providing a near-cost-free method for detecting harmful usage patterns, the research not only contributes to improving model safety but also informs future defenses against increasingly sophisticated jailbreak attempts. The approach highlights the importance of prompt design and model inference in maintaining AI safety, making it a crucial topic for ongoing discourse in AI governance.

📚 Read the Full Paper