← Back to Library

Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

Authors: Sihao Wu, Gaojie Jin, Wei Huang, Jianhong Wang, Xiaowei Huang

Published: 2025-08-30

arXiv ID: 2509.00373v1

Added to Library: 2025-09-04 04:02 UTC

📄 Abstract

Vision Language Models (VLMs) have demonstrated impressive capabilities in integrating visual and textual information for understanding and reasoning, but remain highly vulnerable to adversarial attacks. While activation steering has emerged as a promising defence, existing approaches often rely on task-specific contrastive prompts to extract harmful directions, which exhibit suboptimal performance and can degrade visual grounding performance. To address these limitations, we propose \textit{Sequence-Level Preference Optimization} for VLM (\textit{SPO-VLM}), a novel two-stage defense framework that combines activation-level intervention with policy-level optimization to enhance model robustness. In \textit{Stage I}, we compute adaptive layer-specific steering vectors from diverse data sources, enabling generalized suppression of harmful behaviors during inference. In \textit{Stage II}, we refine these steering vectors through a sequence-level preference optimization process. This stage integrates automated toxicity assessment, as well as visual-consistency rewards based on caption-image alignment, to achieve safe and semantically grounded text generation. The two-stage structure of SPO-VLM balances efficiency and effectiveness by combining a lightweight mitigation foundation in Stage I with deeper policy refinement in Stage II. Extensive experiments shown SPO-VLM enhances safety against attacks via activation steering and preference optimization, while maintaining strong performance on benign tasks without compromising visual understanding capabilities. We will release our code, model weights, and evaluation toolkit to support reproducibility and future research. \textcolor{red}{Warning: This paper may contain examples of offensive or harmful text and images.}

🔍 Key Points

  • Demonstrated that Greedy Coordinate Gradient (GCG) can attack a 20B-parameter large language model (GPT-OSS-20B), showcasing the scalability of gradient-based attacks.
  • Established that evaluation methods significantly impact perceived attack effectiveness, with prefix-based heuristics overestimating success compared to semantic evaluations via GPT-4o.
  • Identified that reasoning-intensive tasks, especially coding prompts, are more vulnerable to attack than safety-oriented prompts, highlighting specific weaknesses in model alignment.
  • Introduced T-GCG, a temperature annealed variant of GCG, which enhances the diversity of adversarial searches, although its real-world effectiveness under strict judgment remains a challenge.
  • Encouraged reform in evaluation protocols for adversarial attacks, emphasizing the importance of accurate assessments for model vulnerability.

💡 Why This Paper Matters

This paper is significant as it advances the understanding of adversarial attacks on large language models, particularly through demonstrating the extent of vulnerabilities in coding and reasoning tasks. By systematically analyzing the effectiveness of gradient-based attacks on models of varying sizes, it emphasizes the need for stronger adversarial evaluation protocols, which is critical for ensuring safety in deploying AI models.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper will be of high interest to AI security researchers as they uncover various vulnerabilities in large language models, particularly in the context of adversarial prompting. By exposing how evaluation methods impact threat perception and showing the attack vectors specifically vulnerable to reasoning tasks, this research provides valuable insights into the security of AI systems and strategies for countering adversarial misuse.

📚 Read the Full Paper