← Back to Library

LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

Authors: Raffaele Mura, Giorgio Piras, Kamilė Lukošiūtė, Maura Pintor, Amin Karbasi, Battista Biggio

Published: 2025-10-07

arXiv ID: 2510.08604v1

Added to Library: 2025-10-13 12:02 UTC

Red Teaming

📄 Abstract

Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.

🔍 Key Points

  • Introduction of LatentBreak, a novel white-box jailbreak attack for large language models that maintains low perplexity through word substitutions.
  • LatentBreak employs latent space optimization to select semantically equivalent word replacements reducing the detectability of jailbreaks by existing filters.
  • Extensive evaluations illustrate that LatentBreak outperforms state-of-the-art jailbreak techniques regarding success rates while avoiding detection by perplexity-based defense mechanisms.
  • The paper demonstrates the significance of prompt length and structure on the effectiveness and detectability of jailbreak attacks, favoring shorter, more coherent prompts.

💡 Why This Paper Matters

The paper presents a notable advancement in the field of AI safety, addressing the critical issue of jailbreak attacks on large language models. By developing LatentBreak, which navigates the latent space for effective prompt modifications, the authors provide a method that enhances the security protocols around AI models while maintaining operational functionality. This research illustrates the need for ongoing development in adversarial defenses and the implications of these methods on practical AI safety applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly significant for AI security researchers as it highlights the vulnerabilities of large language models to sophisticated adversarial attacks that circumvent existing safety mechanisms. The introduction of LatentBreak serves as a beacon for future research aimed at strengthening AI defenses against malicious exploitation. Understanding the mechanics behind these jailbreaks can guide the development of more robust and adaptive AI systems that prioritize user safety while maintaining their intended functionality.

📚 Read the Full Paper