← Back to Library

Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

Authors: Yinzhi Zhao, Ming Wang, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang

Published: 2026-01-15

arXiv ID: 2601.10543v1

Added to Library: 2026-01-16 03:01 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) have achieved impressive performance across natural language tasks and are increasingly deployed in real-world applications. Despite extensive safety alignment efforts, recent studies show that such alignment is often shallow and remains vulnerable to jailbreak attacks. Existing defense mechanisms, including decoding-based constraints and post-hoc content detectors, struggle against sophisticated jailbreaks, often intervening robust detection or excessively degrading model utility. In this work, we examine the decoding process of LLMs and make a key observation: even when successfully jailbroken, models internally exhibit latent safety-related signals during generation. However, these signals are overridden by the model's drive for fluent continuation, preventing timely self-correction or refusal. Building on this observation, we propose a simple yet effective approach that explicitly surfaces and leverages these latent safety signals for early detection of unsafe content during decoding. Experiments across diverse jailbreak attacks demonstrate that our approach significantly enhances safety, while maintaining low over-refusal rates on benign inputs and preserving response quality. Our results suggest that activating intrinsic safety-awareness during decoding offers a promising and complementary direction for defending against jailbreak attacks. Code is available at: https://github.com/zyz13590/SafeProbing.

🔍 Key Points

  • The paper introduces SafeProbing, a novel in-decoding probing technique that uses latent safety-awareness signals from large language models (LLMs) to enhance defense against jailbreak attacks.
  • Experiments demonstrate that SafeProbing significantly increases defense success rates against various sophisticated jailbreak attacks while maintaining low over-refusal rates on benign inputs and preserving response quality.
  • SafeProbing leverages latent safety signals present during the text generation, allowing for real-time intervention rather than post-generation checks, potentially leading to more timely and effective responses to harmful content.

💡 Why This Paper Matters

The relevance of this paper lies in its innovative approach to defending LLMs against jailbreak attacks by utilizing the models' intrinsic safety awareness during the decoding process. This method demonstrates both a significant improvement in safety and a mitigation of the utility loss typically associated with defensive measures, making it a vital contribution to the ongoing research in AI safety and security.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of great interest to AI security researchers as it addresses a pressing issue in the deployment of language models: their vulnerability to jailbreak attacks. By presenting a method that effectively enhances model safety without sacrificing user experience, it opens new avenues for research into resilient, safe AI systems capable of functioning in real-world applications. Additionally, the findings provide insights that could inform the design of future defenses in AI systems, making it a critical contribution to the field.

📚 Read the Full Paper