← Back to Library

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Authors: Jinman Wu, Yi Xie, Shiqian Zhao, Xiaofeng Chen

Published: 2026-03-06

arXiv ID: 2603.05772v1

Added to Library: 2026-03-09 02:02 UTC

Red Teaming

📄 Abstract

Currently, open-sourced large language models (OSLLMs) have demonstrated remarkable generative performance. However, as their structure and weights are made public, they are exposed to jailbreak attacks even after alignment. Existing attacks operate primarily at shallow levels, such as the prompt or embedding level, and often fail to expose vulnerabilities rooted in deeper model components, which creates a false sense of security for successful defense. In this paper, we propose \textbf{\underline{S}}afety \textbf{\underline{A}}ttention \textbf{\underline{H}}ead \textbf{\underline{A}}ttack (\textbf{SAHA}), an attention-head-level jailbreak framework that explores the vulnerability in deeper but insufficiently aligned attention heads. SAHA contains two novel designs. Firstly, we reveal that deeper attention layers introduce more vulnerability against jailbreak attacks. Based on this finding, \textbf{SAHA} introduces \textit{Ablation-Impact Ranking} head selection strategy to effectively locate the most vital layer for unsafe output. Secondly, we introduce a boundary-aware perturbation method, \textit{i.e. Layer-Wise Perturbation}, to probe the generation of unsafe content with minimal perturbation to the attention. This constrained perturbation guarantees higher semantic relevance with the target intent while ensuring evasion. Extensive experiments show the superiority of our method: SAHA improves ASR by 14\% over SOTA baselines, revealing the vulnerability of the attack surface on the attention head. Our code is available at https://anonymous.4open.science/r/SAHA.

🔍 Key Points

  • The paper introduces the Safety Attention Head Attack (SAHA), which targets vulnerabilities in the deeper attention heads of large language models (LLMs), revealing new avenues for jailbreak attacks that existing methods overlook.
  • The authors develop the Ablation-Impact Ranking (AIR) strategy that identifies critical attention heads related to model safety by measuring the performance drop when individual heads are ablated.
  • Layer-Wise Perturbation (LWP) is introduced to efficiently distribute perturbations across selected heads while minimizing semantic distortion, enabling effective evasion of the safety mechanisms of LLMs.
  • Experimental results show that SAHA improves the attack success rate (ASR) by 14% over existing state-of-the-art baselines, indicating a significant advancement in the efficiency and effectiveness of jailbreak techniques.
  • The work highlights the urgent need for enhanced safety mechanisms in LLMs, suggesting that existing defenses are inadequate against deeper-layer vulnerabilities.

💡 Why This Paper Matters

This paper is crucial for understanding the security landscape surrounding open-sourced large language models, as it uncovers vulnerabilities at a deeper architectural level that have not been adequately addressed in prior research. By exposing these weaknesses through novel methods, it provides valuable insights for developing more robust defense strategies against jailbreak attacks.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper significant because it advances the understanding of vulnerabilities within LLMs, illustrating how traditional safety measures can be bypassed. The innovative methodologies presented, particularly AIR and LWP, offer new tools for assessing and improving model safety, making it a vital resource for those working in AI safety and security.

📚 Read the Full Paper