← Back to Library

SafeMT: Multi-turn Safety for Multimodal Language Models

Authors: Han Zhu, Juntao Dai, Jiaming Ji, Haoran Li, Chengkun Cai, Pengcheng Wen, Chi-Min Chan, Boyuan Chen, Yaodong Yang, Sirui Han, Yike Guo

Published: 2025-10-14

arXiv ID: 2510.12133v1

Added to Library: 2025-10-15 04:00 UTC

Red Teaming

📄 Abstract

With the widespread use of multi-modal Large Language models (MLLMs), safety issues have become a growing concern. Multi-turn dialogues, which are more common in everyday interactions, pose a greater risk than single prompts; however, existing benchmarks do not adequately consider this situation. To encourage the community to focus on the safety issues of these models in multi-turn dialogues, we introduce SafeMT, a benchmark that features dialogues of varying lengths generated from harmful queries accompanied by images. This benchmark consists of 10,000 samples in total, encompassing 17 different scenarios and four jailbreak methods. Additionally, we propose Safety Index (SI) to evaluate the general safety of MLLMs during conversations. We assess the safety of 17 models using this benchmark and discover that the risk of successful attacks on these models increases as the number of turns in harmful dialogues rises. This observation indicates that the safety mechanisms of these models are inadequate for recognizing the hazard in dialogue interactions. We propose a dialogue safety moderator capable of detecting malicious intent concealed within conversations and providing MLLMs with relevant safety policies. Experimental results from several open-source models indicate that this moderator is more effective in reducing multi-turn ASR compared to existed guard models.

🔍 Key Points

  • Introduction of SafeMT benchmark, specifically designed for evaluating multi-turn safety mechanisms of multimodal language models (MLLMs).
  • Development of Safety Index (SI) for a nuanced evaluation of MLLM safety during dialogues, addressing shortcomings of traditional metrics like ASR.
  • Proposition of a dialogue safety moderator that enhances the ability of models to detect malicious intent in conversations and uphold safety protocols.
  • Findings that risks of successful attacks on MLLMs increase significantly with the number of turns in dialogues, highlighting safety vulnerabilities in multi-turn interactions.
  • Experimental results demonstrating the effectiveness of the dialogue safety moderator, outperforming existing guard models in reducing attack success rates.

💡 Why This Paper Matters

This paper is highly relevant as it addresses the critical and often neglected area of multi-turn safety in multimodal language models, providing essential insights and tools for researchers and practitioners focused on improving AI safety mechanisms. By introducing a dedicated benchmark and innovative evaluation methods, this work significantly enhances the understanding and capabilities available for mitigating safety risks in conversational AI systems.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of great interest because it tackles urgent safety challenges posed by MLLMs in real-world applications. The introduction of SafeMT and SI provides a comprehensive evaluation framework that can be utilized for assessing and improving the resilience of AI systems against potential vulnerabilities associated with multi-turn dialogues, thus contributing to safer AI deployment.

📚 Read the Full Paper