Analysing the Safety Pitfalls of Steering Vectors

Authors: Yuxiao Li, Alina Fastowski, Efstratios Zaradoukas, Bardh Prenkaj, Gjergji Kasneci

Published: 2026-03-25

arXiv ID: 2603.24543v1

Added to Library: 2026-03-26 03:00 UTC

Red Teaming

📄 Abstract

Activation steering has emerged as a powerful tool to shape LLM behavior without the need for weight updates. While its inherent brittleness and unreliability are well-documented, its safety implications remain underexplored. In this work, we present a systematic safety audit of steering vectors obtained with Contrastive Activation Addition (CAA), a widely used steering approach, under a unified evaluation protocol. Using JailbreakBench as benchmark, we show that steering vectors consistently influence the success rate of jailbreak attacks, with stronger amplification under simple template-based attacks. Across LLM families and sizes, steering the model in specific directions can drastically increase (up to 57%) or decrease (up to 50%) its attack success rate (ASR), depending on the targeted behavior. We attribute this phenomenon to the overlap between the steering vectors and the latent directions of refusal behavior. Thus, we offer a traceable explanation for this discovery. Together, our findings reveal the previously unobserved origin of this safety gap in LLMs, highlighting a trade-off between controllability and safety.

🔍 Key Points

Activation steering can significantly influence the success rate of jailbreak attacks on large language models (LLMs), indicating a safety trade-off between model controllability and safety.
The study provides a systematic safety audit of steering vectors constructed using Contrastive Activation Addition (CAA) across different LLM families and sizes.
A mechanistic explanation is offered for the observed safety gaps: steering vectors can overlap with the latent directions of refusal behavior, directly impacting the attack success rates (ASR).
Directional ablation of refusal-aligned components from steering vectors can mitigate safety vulnerabilities, pointing to potential strategies for improving LLM safety while retaining controllability.
The findings prompt critical reflections on the geometric properties of steering techniques and their interactions with inherent safety mechanisms in LLMs.

💡 Why This Paper Matters

This paper reveals significant safety vulnerabilities in the use of activation steering to control LLM behaviors, demonstrating that while it offers increased controllability, it concurrently poses risks to model safety and alignment. The systematic approach taken in auditing these effects across various models is crucial for understanding the complex dynamics of LLM control and safety, laying the groundwork for future safer steering methodologies.

🎯 Why It's Interesting for AI Security Researchers

This research is of paramount interest to AI security researchers as it highlights the unintended consequences of using activation steering in LLMs, particularly how it can be exploited to bypass safety mechanisms. Understanding these vulnerabilities is critical for developing robust defense strategies against adversarial attacks and ensuring the effective deployment of LLMs in safe and secure applications.

Analysing the Safety Pitfalls of Steering Vectors

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper