← Back to Library

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

Authors: Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua

Published: 2025-06-08

arXiv ID: 2506.07022v1

Added to Library: 2025-06-10 04:02 UTC

Red Teaming

📄 Abstract

As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and utility, we present a theoretically grounded and empirically effective activation steering method called AlphaSteer. Specifically, it considers activation steering as a learnable process with two principled learning objectives: utility preservation and safety enhancement. For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints. For safety enhancement, it learns to construct a refusal direction vector for steering malicious data, with the help of linear regression. Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer, which significantly improves the safety of LLMs without compromising general capabilities. Our codes are available at https://github.com/AlphaLab-USTC/AlphaSteer.

🔍 Key Points

  • AlphaSteer introduces a null-space constraint to activation steering, enabling LLMs to refuse malicious prompts while preserving performance on benign prompts.
  • The proposed method separates the steering process into two learning objectives: utility preservation for benign inputs and safety enhancement for malicious inputs, resulting in improved overall model safety without sacrificing utility.
  • Extensive experiments show that AlphaSteer consistently outperforms existing activation steering baselines in terms of defense success rates against various jailbreak attacks while maintaining high utility scores across benchmark tasks.
  • AlphaSteer employs a learnable transformation matrix to dynamically adjust the steering vectors based on the input activations, which avoids the need for heuristics in determining how to respond to different prompt types.
  • The effectiveness of AlphaSteer is supported by theoretical grounding and empirical validation across multiple LLM architectures and prompt types.

💡 Why This Paper Matters

Overall, this paper presents a significant advancement in the safety of large language models, introducing a novel approach to managing the trade-off between utility and robustness in AI systems. By enhancing the ability of LLMs to handle malicious prompts through principled activation steering, AlphaSteer paves the way for safer deployment of AI technologies in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly valuable for AI security researchers because it addresses critical vulnerabilities in LLMs. As models become more integrated into everyday applications, ensuring their robustness against malicious prompts is crucial. The innovative techniques developed in AlphaSteer provide a well-defined framework for enhancing the safety of LLMs, making it a significant contribution to the field of AI safety and security.

📚 Read the Full Paper