Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

📄 Abstract

Safety prompts constitute an interpretable layer of defense against jailbreak attacks in vision-language models (VLMs); however, their efficacy is constrained by the models' latent structural responsiveness. We observe that such prompts consistently engage a sparse set of parameters that remain largely quiescent during benign use. This finding motivates the Safety Subnetwork Hypothesis: VLMs embed structurally distinct pathways capable of enforcing safety, but these pathways remain dormant without explicit stimulation. To expose and amplify these pathways, we introduce Safety-Potential Pruning, a one-shot pruning framework that amplifies safety-relevant activations by removing weights that are less responsive to safety prompts without additional retraining. Across three representative VLM architectures and three jailbreak benchmarks, our method reduces attack success rates by up to 22% relative to prompting alone, all while maintaining strong benign performance. These findings frame pruning not only as a model compression technique, but as a structural intervention to emerge alignment-relevant subnets, offering a new path to robust jailbreak resistance.

🔍 Key Points

Introduction of the Safety-Potential Pruning method to enhance safety prompts in vision-language models (VLMs) without the need for retraining.
Establishment of the Safety Subnetwork Hypothesis, asserting that effective safety behaviors in VLMs are lateralized to specific sparse subnetworks.
Demonstration of significant improvements in the Defense Success Rate (DSR) against jailbreak attacks, with reductions in attack rates by up to 22% across various models and datasets.
Experimental validation showcasing the robustness of the proposed method across different VLM architectures and tasks while maintaining utility in benign scenarios.
Highlighting the potential of pruning as a structural intervention to activate latent safety mechanisms rather than solely for model compression.

💡 Why This Paper Matters

This paper presents a significant advance in enhancing the safety and robustness of vision-language models against adversarial threats. The proposed Safety-Potential Pruning method provides an efficient mechanism for amplifying safety capacities in VLMs without incurring the computational costs associated with traditional retraining techniques. By focusing on the latent safety-responsive subnetworks, the authors reveal a deeper understanding of how safety behaviors can be operationalized within VLMs, making a compelling case for structure-aware interventions in AI safety.

🎯 Why It's Interesting for AI Security Researchers

The findings in this paper are essential for AI security researchers as they provide new insights into defending against jailbreak attacks, which are a growing concern in deploying large AI models. The innovative methods for pruning and enhancing model safety not only contribute to the literature on model robustness but also offer practical strategies for improving the alignment of AI systems with safety objectives. As AI systems become increasingly integrated into sensitive applications, understanding and mitigating vulnerabilities to adversarial manipulation is crucial.

Safety-Potential Pruning for Enhancing Safety Prompts Against VLM Jailbreaking Without Retraining

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper