Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

📄 Abstract

While Large Language Models (LLMs) are aligned to mitigate risks, their safety guardrails remain fragile against jailbreak attacks. This reveals limited understanding of components governing safety. Existing methods rely on local, greedy attribution that assumes independent component contributions. However, they overlook the cooperative interactions between different components in LLMs, such as attention heads, which jointly contribute to safety mechanisms. We propose \textbf{G}lobal \textbf{O}ptimization for \textbf{S}afety \textbf{V}ector Extraction (GOSV), a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously. We employ two complementary activation repatching strategies: Harmful Patching and Zero Ablation. These strategies identify two spatially distinct sets of safety vectors with consistently low overlap, termed Malicious Injection Vectors and Safety Suppression Vectors, demonstrating that aligned LLMs maintain separate functional pathways for safety purposes. Through systematic analyses, we find that complete safety breakdown occurs when approximately 30\% of total heads are repatched across all models. Building on these insights, we develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching. Our attack substantially outperforms existing white-box attacks across all test models, providing strong evidence for the effectiveness of the proposed GOSV framework on LLM safety interpretability.

🔍 Key Points

Introduction of the Global Optimization for Safety Vector Extraction (GOSV) framework to identify safety-critical attention heads in LLMs, addressing limitations of local attribution methods.
Discovery of two sets of safety vectors—Malicious Injection Vectors and Safety Suppression Vectors—indicating that aligned LLMs have distinct pathways for safety mechanisms.
Identification of a threshold where complete safety breakdown occurs when approximately 30% of total attention heads are manipulated, providing insights into the structural vulnerabilities of LLMs.
Development of a novel inference-time white-box jailbreak method that significantly outperforms existing methods, demonstrating the practical implications of the findings on LLM safety.
Systematic empirical evaluation across multiple models showing consistent patterns and effectiveness of the GOSV framework in enhancing LLM safety interpretability.

💡 Why This Paper Matters

This paper provides crucial insights into the vulnerabilities and safety mechanisms of Large Language Models (LLMs) by introducing a novel framework for identifying critical safety components. The findings highlight the importance of understanding the interdependencies within LLM architectures, which is essential for developing more robust safety measures. By revealing that safety vulnerabilities can be exploited even when traditional alignment strategies are employed, the research underscores the need for more sophisticated safety mechanisms in AI systems. The effective jailbreak method developed offers a practical demonstration of these vulnerabilities, pushing the pursuit of safer and more interpretable AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers as it addresses pressing concerns about the safety and integrity of Large Language Models (LLMs). The exploration of systematic vulnerabilities and the development of effective exploitation techniques provide a framework for understanding potential attack vectors on LLMs. Moreover, the paper contributes to ongoing discussions about AI alignment and safety, emphasizing the importance of considering both the structural and operational aspects of LLMs in security research. As models become more integrated into critical applications, understanding these vulnerabilities is vital for crafting effective defenses and improving AI safety measures.

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper