On Optimizing Multimodal Jailbreaks for Spoken Language Models

Authors: Aravind Krishnan, Karolina Stańczak, Dietrich Klakow

Published: 2026-03-19

arXiv ID: 2603.19127v1

Added to Library: 2026-03-20 03:01 UTC

Red Teaming

📄 Abstract

As Spoken Language Models (SLMs) integrate speech and text modalities, they inherit the safety vulnerabilities of their LLM backbone and an expanded attack surface. SLMs have been previously shown to be susceptible to jailbreaking, where adversarial prompts induce harmful responses. Yet existing attacks largely remain unimodal, optimizing either text or audio in isolation. We explore gradient-based multimodal jailbreaks by introducing JAMA (Joint Audio-text Multimodal Attack), a joint multimodal optimization framework combining Greedy Coordinate Gradient (GCG) for text and Projected Gradient Descent (PGD) for audio, to simultaneously perturb both modalities. Evaluations across four state-of-the-art SLMs and four audio types demonstrate that JAMA surpasses unimodal jailbreak rate by 1.5x to 10x. We analyze the operational dynamics of this joint attack and show that a sequential approximation method makes it 4x to 6x faster. Our findings suggest that unimodal safety is insufficient for robust SLMs. The code and data are available at https://repos.lsv.uni-saarland.de/akrishnan/multimodal-jailbreak-slm

🔍 Key Points

Introduction of JAMA (Joint Audio-text Multimodal Attack) which optimizes text and audio inputs simultaneously, outpacing traditional unimodal attack methods.
Demonstrated that multimodal attacks can increase jailbreak rates by 1.5x to 10x compared to unimodal approaches across various spoken language models (SLMs).
Proposed a sequential approximation method (SAMA) that significantly reduces computational costs while maintaining comparable effectiveness to JAMA, showcasing its potential for more efficient attacks.
Analysis of the joint optimization dynamics revealed that the GCG (Greedy Coordinate Gradient) approach can rapidly drive optimization in early stages, while PGD (Projected Gradient Descent) provides more refined adjustments over time.
The paper highlights that unimodal safety evaluations are inadequate for ensuring the robustness of SLMs against multimodal vulnerabilities.

💡 Why This Paper Matters

This paper is highly relevant as it identifies and addresses critical vulnerabilities in Spoken Language Models (SLMs), emphasizing the need for more effective security measures against multimodal attacks. The introduction of JAMA represents a significant advancement in understanding how combined textual and audio perturbations can pose greater risks to model safety than previously acknowledged methods. Moreover, the findings advocate for the re-evaluation of current safety assessment practices in AI systems.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly interesting as it provides new insights into the vulnerabilities of multimodal AI systems. The novel methodologies introduced, such as the JAMA framework and the sequential optimization approach, offer tools that can be leveraged for both attacking and defending AI systems. The paper’s empirical results demonstrating the effectiveness of multimodal attacks underscore the pressing need for robust safety measures, making it a crucial contribution to the field.

On Optimizing Multimodal Jailbreaks for Spoken Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper