← Back to Library

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Authors: Xiaojun Jia, Jie Liao, Qi Guo, Teng Ma, Simeng Qin, Ranjie Duan, Tianlin Li, Yihao Huang, Zhitao Zeng, Dongxian Wu, Yiming Li, Wenqi Ren, Xiaochun Cao, Yang Liu

Published: 2025-12-06

arXiv ID: 2512.06589v1

Added to Library: 2025-12-09 03:02 UTC

Red Teaming

📄 Abstract

Recent advances in multi-modal large language models (MLLMs) have enabled unified perception-reasoning capabilities, yet these systems remain highly vulnerable to jailbreak attacks that bypass safety alignment and induce harmful behaviors. Existing benchmarks such as JailBreakV-28K, MM-SafetyBench, and HADES provide valuable insights into multi-modal vulnerabilities, but they typically focus on limited attack scenarios, lack standardized defense evaluation, and offer no unified, reproducible toolbox. To address these gaps, we introduce OmniSafeBench-MM, which is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation. OmniSafeBench-MM integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories, structured across consultative, imperative, and declarative inquiry types to reflect realistic user intentions. Beyond data coverage, it establishes a three-dimensional evaluation protocol measuring (1) harmfulness, distinguished by a granular, multi-level scale ranging from low-impact individual harm to catastrophic societal threats, (2) intent alignment between responses and queries, and (3) response detail level, enabling nuanced safety-utility analysis. We conduct extensive experiments on 10 open-source and 8 closed-source MLLMs to reveal their vulnerability to multi-modal jailbreak. By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research. The code is released at https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM.

🔍 Key Points

  • Introduction of OmniSafeBench-MM, a unified toolbox for comprehensive multimodal jailbreak attack-defense evaluation, integrating 13 attack methods and 15 defense strategies.
  • Creation of a large-scale multimodal dataset spanning 9 risk domains and 50 fine-grained categories, addressing previous gaps in dataset comprehensiveness.
  • Development of a three-dimensional evaluation protocol measuring harmfulness, intent alignment, and response detail level, allowing for nuanced safety-utility analysis beyond simple success rates.
  • Extensive experimentation revealing how different MLLMs exhibit varied vulnerability to jailbreak attacks, highlighting the effectiveness and trade-offs of multiple defense mechanisms.
  • Provision of an open-source platform that allows for reproducibility, standardization, and further research exploration in multimodal safety evaluations.

💡 Why This Paper Matters

The introduction of OmniSafeBench-MM marks a significant advancement in the evaluation of multitasking large language models, particularly in understanding and mitigating jailbreak attacks. By establishing standardized evaluation protocols and datasets, the paper provides essential resources that can guide future research in multimodal model safety, contributing to the overall goal of enhancing AI systems' robustness against malicious inputs.

🎯 Why It's Interesting for AI Security Researchers

This paper is of particular interest to AI security researchers as it tackles one of the foremost challenges in the deployment of AI models: security against jailbreak attacks. By providing a comprehensive framework that not only systematically evaluates vulnerabilities but also explores effective defensive strategies, this work lays a groundwork for future research aimed at improving AI model safety and resilience in real-world applications.

📚 Read the Full Paper