← Back to Library

MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models

Authors: Yinan Xia, Yilei Jiang, Yingshui Tan, Xiaoyong Zhu, Xiangyu Yue, Bo Zheng

Published: 2025-06-24

arXiv ID: 2506.19257v1

Added to Library: 2025-06-25 04:00 UTC

Red Teaming

📄 Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.

🔍 Key Points

  • Introduction of MSR-Align, a novel dataset designed for safety alignment in vision-language models (VLMs), addressing unique safety vulnerabilities related to multimodal reasoning.
  • Implementation of a three-phase data generation pipeline, focusing on multimodal diversity, policy-grounded reasoning, and automatic quality filtering to ensure high-quality, safety-compliant training data.
  • Demonstrated substantial improvements in robustness against jailbreak attacks and overall safety performance after fine-tuning VLMs with MSR-Align, while maintaining or improving general reasoning capabilities.
  • Emphasis on structured chain-of-thought (CoT) reasoning that explicitly binds visual and policy context, enabling fine-grained risk detection in model outputs.
  • Extensive evaluation across established benchmarks shows MSR-Align outperforms existing safety datasets, validating its efficacy for multimodal safety alignment.

💡 Why This Paper Matters

This paper is critical in advancing the safety and alignment of multimodal AI systems. By identifying and addressing the unique vulnerabilities of vision-language models, MSR-Align provides concrete solutions to improve model behavior in real-world applications. The establishment of a specialized dataset enhances the reliability and safety of AI systems that interact with both visual and textual data, making it essential for developers and researchers aiming for safer AI deployments.

🎯 Why It's Interesting for AI Security Researchers

This research is of paramount interest to AI security researchers because it tackles the emergent threats associated with multimodal AI systems. By providing a framework for improving safety alignment through rigorous reasoning and policy grounding, it offers insights into minimizing risks of harmful behaviors in VLMs. Furthermore, the findings could help shape future safety protocols and datasets that mitigate vulnerabilities, ultimately contributing to more secure AI applications.

📚 Read the Full Paper