← Back to Library

SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism

Authors: Beitao Chen, Xinyu Lyu, Lianli Gao, Jingkuan Song, Heng Tao Shen

Published: 2025-07-02

arXiv ID: 2507.01513v1

Added to Library: 2025-07-03 04:01 UTC

Red Teaming Safety

📄 Abstract

By incorporating visual inputs, Multimodal Large Language Models (MLLMs) extend LLMs to support visual reasoning. However, this integration also introduces new vulnerabilities, making MLLMs susceptible to multimodal jailbreak attacks and hindering their safe deployment.Existing defense methods, including Image-to-Text Translation, Safe Prompting, and Multimodal Safety Tuning, attempt to address this by aligning multimodal inputs with LLMs' built-in safeguards.Yet, they fall short in uncovering root causes of multimodal vulnerabilities, particularly how harmful multimodal tokens trigger jailbreak in MLLMs? Consequently, they remain vulnerable to text-driven multimodal jailbreaks, often exhibiting overdefensive behaviors and imposing heavy training overhead.To bridge this gap, we present an comprehensive analysis of where, how and which harmful multimodal tokens bypass safeguards in MLLMs. Surprisingly, we find that less than 1% tokens in early-middle layers are responsible for inducing unsafe behaviors, highlighting the potential of precisely removing a small subset of harmful tokens, without requiring safety tuning, can still effectively improve safety against jailbreaks. Motivated by this, we propose Safe Prune-then-Restore (SafePTR), an training-free defense framework that selectively prunes harmful tokens at vulnerable layers while restoring benign features at subsequent layers.Without incurring additional computational overhead, SafePTR significantly enhances the safety of MLLMs while preserving efficiency. Extensive evaluations across three MLLMs and five benchmarks demonstrate SafePTR's state-of-the-art performance in mitigating jailbreak risks without compromising utility.

🔍 Key Points

  • The paper introduces SafePTR, a training-free framework that enhances the safety of Multimodal Large Language Models (MLLMs) against jailbreak attacks without additional computational cost.
  • It identifies that less than 1% of tokens in early-middle layers are responsible for significant unsafe behaviors, allowing for effective pruning of harmful tokens.
  • SafePTR utilizes a Prune-then-Restore mechanism that first removes harmful tokens and then restores benign features to maintain model utility.
  • Extensive experiments demonstrate SafePTR's superior performance over existing defense methods, achieving state-of-the-art results in mitigating multimodal jailbreak risks across multiple benchmarks.
  • The authors elucidate the propagation of harmful token vulnerabilities through an in-depth analysis of layer-wise behavior and semantic drift, providing insights that can guide future defense strategies.

💡 Why This Paper Matters

This paper is significant as it addresses the emerging safety challenges posed by jailbreak vulnerabilities in multimodal models, which are increasingly integrated into real-world applications. By proposing a novel defense mechanism that does not require extensive retraining, it provides a practical and efficient approach to enhancing model robustness while preserving performance. The findings underscore the importance of understanding the underlying mechanisms of adversarial attacks, laying the groundwork for further advancements in AI safety.

🎯 Why It's Interesting for AI Security Researchers

Given the growing reliance on multimodal AI systems in critical applications such as security, healthcare, and customer service, understanding vulnerabilities and methods to defend against them is crucial. This paper is of particular interest to AI security researchers as it offers a new framework that improves the robustness of MLLMs against attacks, providing insights that can be applied to enhance security protocols and safety in AI systems.

📚 Read the Full Paper