← Back to Library

SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models

Authors: Xiyu Zeng, Siyuan Liang, Liming Lu, Haotian Zhu, Enguang Liu, Jisheng Dang, Yongbin Zhou, Shuchao Pang

Published: 2025-09-24

arXiv ID: 2509.21400v1

Added to Library: 2025-09-29 04:02 UTC

📄 Abstract

As the capabilities of Vision Language Models (VLMs) continue to improve, they are increasingly targeted by jailbreak attacks. Existing defense methods face two major limitations: (1) they struggle to ensure safety without compromising the model's utility; and (2) many defense mechanisms significantly reduce the model's inference efficiency. To address these challenges, we propose SafeSteer, a lightweight, inference-time steering framework that effectively defends against diverse jailbreak attacks without modifying model weights. At the core of SafeSteer is the innovative use of Singular Value Decomposition to construct a low-dimensional "safety subspace." By projecting and reconstructing the raw steering vector into this subspace during inference, SafeSteer adaptively removes harmful generation signals while preserving the model's ability to handle benign inputs. The entire process is executed in a single inference pass, introducing negligible overhead. Extensive experiments show that SafeSteer reduces the attack success rate by over 60% and improves accuracy on normal tasks by 1-2%, without introducing significant inference latency. These results demonstrate that robust and practical jailbreak defense can be achieved through simple, efficient inference-time control.

🔍 Key Points

  • Proposed the JaiLIP method, which employs loss-guided image perturbation to effectively jailbreak vision-language models while maintaining imperceptibility.
  • Showed that JaiLIP outperforms existing PGD-based methods in generating toxic text outputs, yielding higher toxicity scores while keeping the perturbations visually minimal.
  • Conducted extensive experimental evaluations on MiniGPT-4 and BLIP-2, demonstrating the method's effectiveness across different domains, including a transportation use case.
  • Introduced an optimization strategy that combines mean squared error (MSE) loss and the model's harmful-output loss, allowing for better balancing between attack effectiveness and visual similarity.
  • Highlighted the urgent need for robust defense mechanisms against image-based jailbreak attacks in multimodal systems.

💡 Why This Paper Matters

The JaiLIP paper presents a significant advancement in the understanding and exploitation of vulnerabilities in vision-language models. By providing a novel method for creating imperceptible adversarial images that cause harmful outputs, it sheds light on the potential risks associated with multimodal AI systems. This work is essential for both advancing AI robustness and addressing the ethical implications of such technologies.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it addresses emerging vulnerabilities in complex multimodal systems, particularly concerning the safety and alignment of vision-language models. Its exploration of new attack vectors and emphasis on the need for effective defenses provide a foundation for future research aimed at improving the resilience of AI systems against adversarial threats.

📚 Read the Full Paper