NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

Authors: John Hawkins, Aditya Pramar, Rodney Beard, Rohitash Chandra

Published: 2025-10-02

arXiv ID: 2510.01644v1

Added to Library: 2025-10-03 04:00 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer's policies. In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses, including looking at our ability to identify jailbreaks that use previously unseen strategies. Our results indicate that using current datasets the best performance is achieved by fine tuning a Bidirectional Encoder Representations from Transformers (BERT) model end-to-end for identifying jailbreaks. We visualise the keywords that distinguish jailbreak from genuine prompts and conclude that explicit reflexivity in prompt structure could be a signal of jailbreak intention.

🔍 Key Points

The study develops a fine-tuned BERT model to detect jailbreak prompts in large language models (LLMs), achieving significantly improved accuracy over traditional methods.
It introduces a classification system that distinguishes between known jailbreak types and assesses the detection capabilities for novel, unseen jailbreak strategies.
Feature analysis reveals that explicit reflexivity in prompt structure—especially references to corporate policies—could be a key indicator of jailbreak intent.
Data augmentation techniques, such as back translation and synonym replacement, were employed to enhance the robustness of the model against varied jailbreak prompts.
The research contributes to the understanding of linguistic elements that characterize jailbreaks, potentially guiding future defenses.

💡 Why This Paper Matters

This paper is crucial as it addresses a pressing vulnerability in LLMs: the ability of malicious users to manipulate input prompts to generate undesirable outputs. By developing robust detection methods and providing insights into prompt characteristics, this research enhances the security measures for LLM deployment in practical applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper valuable because it tackles the emerging threats posed by jailbreaks, which could lead to unchecked model behavior. The proposed BERT-based detection methodology, along with insights into prompt linguistics, offers practical tools and frameworks for mitigating risks in AI applications.

NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper