Efficient Refusal Ablation in LLM through Optimal Transport

📄 Abstract

Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully chosen layers at approximately 40-60% network depth) substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed. Our analysis provides new insights into the geometric structure of safety representations and suggests that current alignment methods may be vulnerable to distributional attacks beyond simple direction removal.

🔍 Key Points

Introduces an optimal transport framework for the ablation of harmful refusal patterns in language models, allowing for a more sophisticated approach compared to previous single-direction efforts.
Combines Gaussian optimal transport with PCA to enable efficient computation in high-dimensional space while preserving the geometric structure of activations.
Demonstrates that targeted layer interventions can significantly outperform full-network interventions, suggesting that refusal mechanisms are localized in specific layers.
Achieves higher attack success rates (up to 11% improvement over state-of-the-art methods) while maintaining language model utility, as measured by perplexity on benchmark datasets.
Provides insights into the vulnerabilities of current alignment methods, showcasing the potential for distributional attacks beyond simple directional removals.

💡 Why This Paper Matters

This paper is significant as it offers a novel methodology for addressing vulnerabilities in safety-aligned language models. By leveraging optimal transport theory, it not only improves the effectiveness of attacks on harmful requests but also provides critical insights into the geometric structure of model representations. This dual approach highlights the need for more robust safety mechanisms in LLMs, particularly in the face of increasingly sophisticated bypass techniques.

🎯 Why It's Interesting for AI Security Researchers

This research is of high interest to AI security researchers because it delves deep into the vulnerabilities inherent in current language model safety frameworks. The method introduces advanced techniques for circumventing established defense mechanisms, underscoring the necessity for ongoing improvements in AI alignment and safety strategies. The findings provide a framework for understanding how to enhance model robustness against adversarial manipulations.

Efficient Refusal Ablation in LLM through Optimal Transport

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper