โ† Back to Library

VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

Authors: Aofan Liu, Lulu Tang

Published: 2025-10-09

arXiv ID: 2510.09699v1

Added to Library: 2025-10-14 04:01 UTC

Red Teaming

๐Ÿ“„ Abstract

Vision-Language Models (VLMs) have garnered significant attention for their remarkable ability to interpret and generate multimodal content. However, securing these models against jailbreak attacks continues to be a substantial challenge. Unlike text-only models, VLMs integrate additional modalities, introducing novel vulnerabilities such as image hijacking, which can manipulate the model into producing inappropriate or harmful responses. Drawing inspiration from text-based jailbreaks like the "Do Anything Now" (DAN) command, this work introduces VisualDAN, a single adversarial image embedded with DAN-style commands. Specifically, we prepend harmful corpora with affirmative prefixes (e.g., "Sure, I can provide the guidance you need") to trick the model into responding positively to malicious queries. The adversarial image is then trained on these DAN-inspired harmful texts and transformed into the text domain to elicit malicious outputs. Extensive experiments on models such as MiniGPT-4, MiniGPT-v2, InstructBLIP, and LLaVA reveal that VisualDAN effectively bypasses the safeguards of aligned VLMs, forcing them to execute a broad range of harmful instructions that severely violate ethical standards. Our results further demonstrate that even a small amount of toxic content can significantly amplify harmful outputs once the model's defenses are compromised. These findings highlight the urgent need for robust defenses against image-based attacks and offer critical insights for future research into the alignment and security of VLMs.

๐Ÿ” Key Points

  • Introduction of VisualDAN as an adversarial method that embeds DAN commands into images to compromise the integrity of Vision-Language Models (VLMs).
  • Extensive experimental validation on various VLMs, including MiniGPT-4, MiniGPT-v2, InstructBLIP, and LLaVA, demonstrating high attack success rates and the ability to generate harmful content.
  • Identification of specific vulnerabilities in VLMs due to their multimodal natureโ€”particularly susceptibility to adversarial images, which can bypass traditional text-based safeguards.
  • Analyses of the impact of toxic content on the effectiveness of attacks, showing that even minimal toxic prompts can lead to significant harmful outputs once defenses are compromised.
  • Highlighting the urgent need for enhanced security measures in VLMs, along with recommendations for future research directions in this domain.

๐Ÿ’ก Why This Paper Matters

The paper introduces a significant new approach to attacking Vision-Language Models via vulnerabilities unique to their multimodal structure, illustrating critical challenges in AI security. It not only uncovers specific weaknesses in existing models, but also emphasizes the profound risks posed by adversarial inputs in practical applications, making it imperative for the community to address these security gaps.

๐ŸŽฏ Why It's Interesting for AI Security Researchers

Given the growing reliance on VLMs for applications ranging from content moderation to automated responses, this paper is crucial for AI security researchers. It provides insights into the innovative strategies utilized in adversarial attacks, highlights vulnerabilities inherent in VLMs, and sets the stage for developing robust defenses, making it a valuable resource for improving the safety and reliability of AI systems.

๐Ÿ“š Read the Full Paper