SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

Authors: Shaona Ghosh, Amrita Bhattacharjee, Yftah Ziser, Christopher Parisien

Published: 2025-06-01

arXiv ID: 2506.04250v1

Added to Library: 2025-06-06 05:00 UTC

Safety

📄 Abstract

Fine-tuning large language models (LLMs) to adapt to evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, yet its potential for precise, customizable safety adjustments remains largely untapped. This paper investigates an approach called SafeSteer for guiding the outputs of LLMs by: (i) leveraging category-specific steering vectors for more precise control, (ii) employing a simple, gradient-free unsupervised method to enhance safety steering while preserving text quality, topic relevance, and without explicit refusal, and (iii) accomplishing this without a hard requirement of contrastive pairwise safe data. We also highlight that our method, being simple and effective, aligns with recent studies suggesting that simple techniques often outperform more complex ones in activation steering. We showcase the effectiveness of our approach across various LLMs, datasets, and risk categories, demonstrating its ability to provide precise control, prevent blanket refusals, and guide models toward generating safe content while maintaining topic relevance.

🔍 Key Points

Introduction of SafeSteer, a method for real-time safety steering of large language models (LLMs) during inference without additional training.
Utilization of category-specific steering vectors to provide fine-grained safety adjustments and improve output quality.
Demonstration of improved performance through unsupervised activation steering compared to existing complex methods, showing simplicity can lead to better results.
Successful application across various datasets and model architectures, enhancing safety while maintaining topic relevance and reducing blanket refusals.

💡 Why This Paper Matters

This paper presents SafeSteer as a significant advancement in the control of LLM outputs, catering to the growing need for customizable and interpretable safety measures. By enabling fine-tuned steering without retraining, it addresses practical limitations in maintaining consistent model safety amidst evolving risks.

🎯 Why It's Interesting for AI Security Researchers

The findings and methodologies presented in this paper are crucial for AI security researchers focusing on the mitigation of potential harms posed by LLMs. It establishes a framework for real-time interventions that can enhance the robustness and safety of these models in practical applications, thereby reducing risks associated with harmful outputs.

SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper