Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

📄 Abstract

Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the \textbf{\underline{D}}isentangled \textbf{\underline{S}}afety \textbf{\underline{H}}ypothesis \textbf{(DSH)}, positing that safety computation operates on two distinct subspaces: a \textit{Recognition Axis} ($\mathbf{v}_H$, ``Knowing'') and an \textit{Execution Axis} ($\mathbf{v}_R$, ``Acting''). Our geometric analysis reveals a universal ``Reflex-to-Dissociation'' evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce \textit{Double-Difference Extraction} and \textit{Adaptive Causal Steering}. Using our curated \textsc{AmbiguityBench}, we demonstrate a causal double dissociation, effectively creating a state of ``Knowing without Acting.'' Crucially, we leverage this disentanglement to propose the \textbf{Refusal Erasure Attack (REA)}, which achieves State-of-the-Art attack success rates by surgically lobotomizing the refusal mechanism. Furthermore, we uncover a critical architectural divergence, contrasting the \textit{Explicit Semantic Control} of Llama3.1 with the \textit{Latent Distributed Control} of Qwen2.5. The code and dataset are available at https://anonymous.4open.science/r/DSH.

🔍 Key Points

Introduction of the Disentangled Safety Hypothesis (DSH), proposing that safety mechanisms in large language models operate on two distinct axes: Recognition (Knowing) and Execution (Acting).
Evidence of a 'Reflex-to-Dissociation' trajectory demonstrating that the entanglement of recognition and execution in early layers transitions to structural independence in deeper layers, leading to vulnerabilities to jailbreak attacks.
Development of new methodologies, including Double-Difference Extraction and Adaptive Causal Steering, to effectively separate the Recognition and Execution axes, facilitating more precise control over AI behavior.
Validation of the newly proposed Refusal Erasure Attack (REA), achieving state-of-the-art success rates in bypassing safety mechanisms of large language models by removing the refusal capability while preserving harmful understanding.
Architectural differences found between models (e.g., Llama3.1 and Qwen2.5) illustrate the diversity of approaches to safety, highlighting the need for tailored safety architectures.

💡 Why This Paper Matters

The paper makes significant strides in understanding and addressing vulnerabilities in large language models regarding safety mechanisms. By dissecting the relationship between harmfulness recognition and execution, it provides crucial insights into mechanisms that can lead to jailbreak vulnerabilities and suggests robust methods for enhancing AI safety that could inform future AI model development.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper relevant as it presents advanced methodologies to analyze and improve the safety mechanisms of large language models, addresses persistent vulnerabilities to manipulation (jailbreaks), and emphasizes the importance of theoretical understanding along with practical application, which can guide the design of more secure AI systems.

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper