← Back to Library

NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks

Authors: Lama Sleem, Jerome Francois, Lujun Li, Nathan Foucher, Niccolo Gentile, Radu State

Published: 2025-11-14

arXiv ID: 2511.11784v1

Added to Library: 2025-11-18 03:02 UTC

Red Teaming

📄 Abstract

Jailbreak attacks designed to bypass safety mechanisms pose a serious threat by prompting LLMs to generate harmful or inappropriate content, despite alignment with ethical guidelines. Crafting universal filtering rules remains difficult due to their inherent dependence on specific contexts. To address these challenges without relying on threshold calibration or model fine-tuning, this work introduces a semantic consistency analysis between successful and unsuccessful responses, demonstrating that a negation-aware scoring approach captures meaningful patterns. Building on this insight, a novel detection framework called NegBLEURT Forest is proposed to evaluate the degree of alignment between outputs elicited by adversarial prompts and expected safe behaviors. It identifies anomalous responses using the Isolation Forest algorithm, enabling reliable jailbreak detection. Experimental results show that the proposed method consistently achieves top-tier performance, ranking first or second in accuracy across diverse models using the crafted dataset, while competing approaches exhibit notable sensitivity to model and data variations.

🔍 Key Points

  • Introduces NegBLEURT Forest, a novel framework for detecting jailbreak attacks in LLM outputs through semantic consistency analysis.
  • Utilizes semantic comparison between successful and unsuccessful jailbreak responses using a negation-aware scoring method, which outperforms traditional cosine similarity measures.
  • Applies the Isolation Forest algorithm for anomalous response detection, enabling reliable identification of harmful prompt outputs without predefined thresholds.
  • Demonstrates superior performance in various experimental settings, achieving high accuracy and robustness in comparison to existing detection methods.
  • Highlights the importance of analyzing model refusal behaviors, emphasizing how variability in responses necessitates more flexible and adaptive detection frameworks.

💡 Why This Paper Matters

The paper presents a significant advancement in the detection of jailbreak attacks on large language models by proposing the NegBLEURT Forest framework, which leverages semantic consistency to improve detection reliability without reliance on rigid, predefined thresholds. This approach not only enhances security measures but also responds proactively to the evolving landscape of AI model behaviors in the context of safety mechanisms.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant as it addresses a critical and emerging challenge posed by jailbreak attacks that can exploit vulnerabilities in large language models. The proposed method offers a fresh, data-driven approach to enhancing model safety, paving the way for more effective security protocols and strategies in AI deployment, thereby contributing to the broader discourse on responsible AI usage.

📚 Read the Full Paper