SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

Authors: Maithili Joshi, Palash Nandi, Tanmoy Chakraborty

Published: 2025-09-19

arXiv ID: 2509.16060v1

Added to Library: 2025-09-22 04:01 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) with safe-alignment training are powerful instruments with robust language comprehension capabilities. These models typically undergo meticulous alignment procedures involving human feedback to ensure the acceptance of safe inputs while rejecting harmful or unsafe ones. However, despite their massive scale and alignment efforts, LLMs remain vulnerable to jailbreak attacks, where malicious users manipulate the model to produce harmful outputs that it was explicitly trained to avoid. In this study, we find that the safety mechanisms in LLMs are predominantly embedded in the middle-to-late layers. Building on this insight, we introduce a novel white-box jailbreak method, SABER (Safety Alignment Bypass via Extra Residuals), which connects two intermediate layers $s$ and $e$ such that $s < e$, through a residual connection. Our approach achieves a 51% improvement over the best-performing baseline on the HarmBench test set. Furthermore, SABER induces only a marginal shift in perplexity when evaluated on the HarmBench validation set. The source code is publicly available at https://github.com/PalGitts/SABER.

🔍 Key Points

Introduces SABER (Safety Alignment Bypass via Extra Residuals), a white-box jailbreak method that connects two intermediate layers in LLMs to subvert safety mechanisms
Achieves a significant improvement of 51% over the best-performing baseline on the HarmBench test set, demonstrating enhanced jailbreak effectiveness
Explores the vulnerabilities in LLMs, specifically identifying that safety mechanisms predominantly reside in middle-to-late layers of the architecture
Offers a systematic framework for detecting layer boundaries and optimizing residual connections, preserving language capabilities with minimal shifts in perplexity
Releases the source code publicly, facilitating further research and examination in the area of LLM security.

💡 Why This Paper Matters

The paper presents SABER as a critical advancement in understanding and exploiting vulnerabilities in safety-aligned large language models. By effectively bypassing safety measures without extensive training or reconfiguration, this research provides valuable insights into the weaknesses of existing frameworks, highlighting the need for improved defenses in AI systems used widely across various applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it investigates a significant attack vector against LLMs, which are increasingly deployed in sensitive tasks. Understanding how vulnerabilities can be exploited helps inform the design of more robust models. Furthermore, the method introduces practical implications for enhancing safety measures against jailbreaking, a growing concern in the AI community.

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper