← Back to Library

DefenSee: Dissecting Threat from Sight and Text - A Multi-View Defensive Pipeline for Multi-modal Jailbreaks

Authors: Zihao Wang, Kar Wai Fok, Vrizlynn L. L. Thing

Published: 2025-12-01

arXiv ID: 2512.01185v1

Added to Library: 2025-12-02 04:01 UTC

📄 Abstract

Multi-modal large language models (MLLMs), capable of processing text, images, and audio, have been widely adopted in various AI applications. However, recent MLLMs integrating images and text remain highly vulnerable to coordinated jailbreaks. Existing defenses primarily focus on the text, lacking robust multi-modal protection. As a result, studies indicate that MLLMs are more susceptible to malicious or unsafe instructions, unlike their text-only counterparts. In this paper, we proposed DefenSee, a robust and lightweight multi-modal black-box defense technique that leverages image variants transcription and cross-modal consistency checks, mimicking human judgment. Experiments on popular multi-modal jailbreak and benign datasets show that DefenSee consistently enhances MLLM robustness while better preserving performance on benign tasks compared to SOTA defenses. It reduces the ASR of jailbreak attacks to below 1.70% on MiniGPT4 using the MM-SafetyBench benchmark, significantly outperforming prior methods under the same conditions.

🔍 Key Points

  • The study evaluates JATMO, a fine-tuning approach that reduces the vulnerability of LLMs to prompt injection attacks by training them for a single task.
  • HOUYI is introduced as a robust attack framework, showcasing how it systematically mutates and optimizes prompts to test model defenses.
  • Results show that while JATMO decreases attack success rates compared to instruction-tuned models, it does not eliminate vulnerabilities, especially against sophisticated adversarial techniques.
  • There is an observed trade-off between the generative quality of models and their susceptibility to prompt injections, indicating a complex relationship between task performance and security.
  • The authors propose future avenues for strengthening defenses that may involve layered approaches combining task specialization with real-time monitoring and validation.

💡 Why This Paper Matters

This paper provides crucial insights into the security vulnerabilities of LLMs against prompt injection attacks and highlights the effectiveness and limitations of the JATMO fine-tuning method. By systematically evaluating the robustness of language models under adversarial conditions, the research points to the necessity of developing comprehensive defense strategies that address security concerns while maintaining model performance.

🎯 Why It's Interesting for AI Security Researchers

This paper is of interest to AI security researchers because it examines a critical vulnerability in large language models that are increasingly deployed in sensitive applications. The findings elucidate both the effectiveness and shortcomings of existing mitigation strategies, such as task-specific fine-tuning, whereas the proposed modifications to attack frameworks offer a blueprint for more rigorous testing of security measures in natural language processing systems. The paper serves as a call to action for improved defenses that integrate adversarial training and real-time safeguards, underscoring the pressing need for robust security mechanisms in AI deployments.

📚 Read the Full Paper