← Back to Library

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

Authors: Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, Battista Biggio

Published: 2025-11-11

arXiv ID: 2511.08379v2

Added to Library: 2025-11-14 23:01 UTC

📄 Abstract

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model's latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models' internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

🔍 Key Points

  • Introduction of Rebellion, a novel reasoning training method that enhances audio reasoning models (ARMs) by making them robust against sophisticated audio jailbreak attacks.
  • Identification of the vulnerability of standard reasoning training (RT) to advanced jailbreaks due to representation drift, which allows harmful responses to bypass safety guardrails.
  • Rigorous experimental validation showing that Rebellion maintains high performance in benign tasks while significantly reducing harmful outputs in presence of audio jailbreaks, demonstrating strong safety-accuracy trade-offs.
  • Discovery of a 'think twice' behavior in Rebellion-trained ARMs, indicating an internal safety check mechanism that leads to correct refusal of harmful queries despite initial compliance triggered by jailbreaks.
  • Establishment of a dual dataset approach, using both safety and benign reasoning data for training, thus ensuring comprehensive reasoning capabilities.

💡 Why This Paper Matters

This paper is significant as it addresses a critical gap in the safety of audio reasoning models when exposed to sophisticated attacks. By proposing Rebellion, it not only provides a practical solution for enhancing the security of ARMs but also contributes to the broader discourse on robustness and safety in AI models, which is increasingly essential as these systems are deployed in sensitive applications.

🎯 Why It's Interesting for AI Security Researchers

The findings presented in this paper are especially relevant to AI security researchers, as they highlight vulnerabilities in existing audio reasoning models and propose effective countermeasures against emerging threat vectors like jailbreak embeddings. With the increasing reliance on AI for decision-making processes across various sectors, understanding and mitigating such risks is crucial for developing safe and reliable AI systems.

📚 Read the Full Paper