← Back to Library

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

Authors: Chongwen Zhao, Kaizhu Huang

Published: 2025-09-01

arXiv ID: 2509.01631v1

Added to Library: 2025-09-04 04:00 UTC

Red Teaming Safety

πŸ“„ Abstract

Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation, a technique known as "Jailbreak." While some studies have achieved defenses against jailbreak attacks by modifying output distributions or detecting harmful content, the exact rationale still remains elusive. In this work, we present a novel neuron-level interpretability method that focuses on the role of safety-related knowledge neurons. Unlike existing approaches, our method projects the model's internal representation into a more consistent and interpretable vocabulary space. We then show that adjusting the activation of safety-related neurons can effectively control the model's behavior with a mean ASR higher than 97%. Building on this insight, we propose SafeTuning, a fine-tuning strategy that reinforces safety-critical neurons to improve model robustness against jailbreaks. SafeTuning consistently reduces attack success rates across multiple LLMs and outperforms all four baseline defenses. These findings offer a new perspective on understanding and defending against jailbreak attacks.

πŸ” Key Points

  • Introduction of a novel neuron-level interpretability method for examining safety-related knowledge neurons in Large Language Models (LLMs).
  • Demonstration that adjusting the activation of safety-related neurons can effectively control model behavior with over 97% average attack success rate.
  • Proposal of SafeTuning, a fine-tuning strategy that reinforces safety-critical neurons, improving LLM robustness against jailbreaks and consistently outperforming existing defense methods.
  • Empirical evidence shows that safety knowledge neurons can be calibrated to shift LLM responses from rejecting harmful prompts to complying with them, highlighting the vulnerabilities of aligned LLMs.
  • Comprehensive analysis and comparison of attack methods and defense mechanisms against jailbreaks, providing a clearer understanding of model behavior.

πŸ’‘ Why This Paper Matters

This paper presents significant advancements in understanding and defending against jailbreak attacks in Large Language Models (LLMs) by introducing innovative interpretability methods and a fine-tuning strategy. The findings not only elucidate the crucial role of safety knowledge neurons in decision-making processes of LLMs but also offer effective solutions that enhance model resilience against malicious exploitation. As AI systems become ubiquitous, addressing these vulnerabilities is vital for ensuring the ethical deployment of LLMs.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it directly tackles the challenges posed by jailbreak attacksβ€”an emerging threat in the deployment of LLMs. By providing novel insights into neuron-level interpretability and defense mechanisms, the research contributes to the broader understanding of model vulnerabilities and mitigation strategies. Furthermore, the practical implications of the proposed methods could pave the way for developing more secure AI systems, making it a valuable resource for those focused on enhancing AI safety and robustness.

πŸ“š Read the Full Paper