← Back to Library

Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap

Authors: Wenhan Yang, Spencer Stice, Ali Payani, Baharan Mirzasoleiman

Published: 2025-05-30

arXiv ID: 2505.24208v1

Added to Library: 2025-06-02 03:02 UTC

Safety

📄 Abstract

Ensuring Vision-Language Models (VLMs) generate safe outputs is crucial for their reliable deployment. However, LVLMs suffer from drastic safety degradation compared to their LLM backbone. Even blank or irrelevant images can trigger LVLMs to generate harmful responses to prompts that would otherwise be refused in text-only contexts. The modality gap between image and text representations has been recently hypothesized to contribute to safety degradation of LVLMs. However, if and how the amount of modality gap affects LVLMs' safety is not studied. In this work, we show that the amount of modality gap is highly inversely correlated with VLMs' safety. Then, we show that this modality gap is introduced during pretraining LVLMs and persists through fine-tuning. Inspired by this observation, we propose a regularization to reduce the modality gap during pretraining. Our extensive experiments on LLaVA v1.5, ShareGPT4V, and MiniGPT-4 show that our method substantially improves safety alignment of LVLMs, reducing unsafe rate by up to 16.3% without compromising performance, and can further boost existing defenses by up to 18.2%.

🔍 Key Points

  • The paper introduces ReGap, a novel regularization method that reduces the modality gap between image and text embeddings during the pretraining phase of large vision-language models (LVLMs), leading to improved safety alignment.
  • Extensive experiments demonstrate that reducing the modality gap is correlated with a decrease in the unsafe output rates of LVLMs, achieving up to a 16.3% reduction in unsafe outputs without compromising performance across various benchmarks.
  • The study highlights that the modality gap contributes to safety degradation and that this gap originates during the pretraining phase, persisting through fine-tuning; thus, addressing it early is crucial for building safer LVLMs.
  • The proposed method is shown to enhance existing safety measures, with improvements in safety effectiveness of up to 18.2% when combined with other defense mechanisms, emphasizing its compatibility and effectiveness across different architectures.
  • The research is backed by extensive evaluations on popular LVLM architectures like LLaVA, ShareGPT4V, and MiniGPT-4, indicating the generalizability of ReGap across various models and datasets.

💡 Why This Paper Matters

This paper is significant as it addresses a critical issue in the safety of LVLMs by introducing a simple yet effective method to enhance their robustness against harmful prompts. By targeting the modality gap during pretraining, it not only contributes to the theoretical understanding of safety alignment in multimodal models but also provides practical solutions that can be readily integrated into current methodologies. The findings reinforce the necessity of ensuring safe AI outputs in real-world applications, making the paper relevant for both researchers and practitioners in the field.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly interesting as it tackles the vulnerabilities associated with LVLMs, a growing area of concern in AI safety and ethical AI deployment. The insights into the modality gap and its implications for safety degradation present critical knowledge for developing more secure models that can resist adversarial attacks, such as jailbreak prompts. Moreover, by providing a method to enhance safety without extensive additional costs or data, the paper resonates with the overarching goals of improving AI robustness in real-world applications.

📚 Read the Full Paper