← Back to Library

HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

Authors: Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang

Published: 2025-06-05

arXiv ID: 2506.04704v1

Added to Library: 2025-06-06 05:00 UTC

📄 Abstract

Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation. We further propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head. The meta token encodes harmful visual cues during training, intrinsically guiding the language model toward safer responses, while the safety head offers interpretable harmfulness classification aligned with refusal rationales. Experiments show that SafeLLaVA, trained on HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe benchmark itself reveals critical vulnerabilities in existing models. We hope that HoliSafe and SafeLLaVA will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.

🔍 Key Points

  • Identification of representation similarity between upstream alignment data and downstream fine-tuning tasks as a critical factor affecting the robustness of large language model (LLM) safety guardrails.
  • Demonstration through experiments that high similarity between datasets significantly increases vulnerability to jailbreak attacks, with reductions in harmfulness scores by up to 10.33% from low-similarity configurations.
  • Proposed an actionable method for selecting safety-alignment data subsets based on similarity metrics, which enhances model safety and security during fine-tuning.
  • Findings reveal that existing safety alignment methods may overlook critical upstream dataset design implications, suggesting a shift towards proactive design strategies for alignment datasets.
  • Introduces the potential for enhanced model selection pipelines that incorporate representation similarity metrics to mitigate risks during the fine-tuning process.

💡 Why This Paper Matters

This paper is important as it highlights a previously underexplored area in LLM safety— the impact of upstream alignment dataset characteristics on downstream performance. By revealing how dataset design can directly influence the durability of safety guardrails, it paves the way for more effective safety strategies in the development and deployment of LLMs, ensuring they are more resilient against exploitation and misuse.

🎯 Why It's Interesting for AI Security Researchers

This paper is of significant interest to AI security researchers as it addresses a key vulnerability in the deployment of LLMs— their susceptibility to jailbreaking through fine-tuning with similar datasets. The findings emphasize the need for a deeper understanding of dataset characteristics in developing robust AI systems. It introduces novel methodologies for dataset evaluation and selection, offering practical implications for enhancing the safety and security of AI applications.

📚 Read the Full Paper