← Back to Library

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations

Authors: Divyanshu Kumar, Shreyas Jena, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi

Published: 2025-10-23

arXiv ID: 2510.20223v1

Added to Library: 2025-10-24 04:00 UTC

Red Teaming

📄 Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress, yet remain critically vulnerable to adversarial attacks that exploit weaknesses in cross-modal processing. We present a systematic study of multimodal jailbreaks targeting both vision-language and audio-language models, showing that even simple perceptual transformations can reliably bypass state-of-the-art safety filters. Our evaluation spans 1,900 adversarial prompts across three high-risk safety categories harmful content, CBRN (Chemical, Biological, Radiological, Nuclear), and CSEM (Child Sexual Exploitation Material) tested against seven frontier models. We explore the effectiveness of attack techniques on MLLMs, including FigStep-Pro (visual keyword decomposition), Intelligent Masking (semantic obfuscation), and audio perturbations (Wave-Echo, Wave-Pitch, Wave-Speed). The results reveal severe vulnerabilities: models with almost perfect text-only safety (0\% ASR) suffer >75\% attack success under perceptually modified inputs, with FigStep-Pro achieving up to 89\% ASR in Llama-4 variants. Audio-based attacks further uncover provider-specific weaknesses, with even basic modality transfer yielding 25\% ASR for technical queries. These findings expose a critical gap between text-centric alignment and multimodal threats, demonstrating that current safeguards fail to generalize across cross-modal attacks. The accessibility of these attacks, which require minimal technical expertise, suggests that robust multimodal AI safety will require a paradigm shift toward broader semantic-level reasoning to mitigate possible risks.

🔍 Key Points

  • First systematic study of multimodal jailbreak attacks targeting vision-language and audio-language models, dissecting vulnerabilities across these advanced AI systems.
  • Development of simple perceptual transformations that effectively circumvent state-of-the-art safety filters, demonstrating alarming attack success rates exceeding 75% for some models.
  • Introduction and experimentation with diverse attack techniques including FigStep-Pro, Intelligent Masking, and audio perturbations like Wave-Echo, identifying their effectiveness and scalability.
  • Comprehensive evaluation framework testing various model responses to 1,900 adversarial prompts, providing significant empirical evidence of vulnerabilities in current models.
  • Highlighting the critical misalignment between text-based safety mechanisms and the realities of multimodal threats, calling for a paradigm shift in AI safety strategies.

💡 Why This Paper Matters

This paper is significant as it sheds light on the overlooked vulnerabilities in multimodal AI systems, demonstrating that existing safety measures are ineffective against simple perceptual manipulations. The findings call into question the fundamental assumptions underpinning current AI safety protocols, emphasizing that without addressing these glaring issues, responsible AI deployment cannot be ensured. Given the rapid integration of multimodal models into critical applications, understanding these vulnerabilities is paramount.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers because it not only outlines a clear and systematic methodology for evaluating multimodal security but also reveals the practical implications of vulnerabilities in widely used AI systems. The simplicity of the attacks makes them accessible to adversaries with minimal technical skills, elevating the urgency of implementing robust defenses. Moreover, the empirical evidence and analysis provided can inform future safety research, guiding the development of more resilient AI safety frameworks that address cross-modal threats.

📚 Read the Full Paper