When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Authors: Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin

Published: 2025-08-05

arXiv ID: 2508.03365v1

Added to Library: 2025-08-14 23:04 UTC

Red Teaming

📄 Abstract

As large language models become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that can manipulate state-of-the-art audio language models to generate harmful content. Our method uses imperceptible perturbations in audio inputs that remain benign to human listeners. The first stage uses a novel reward-based optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to guide the target model to circumvent its own safety protocols and generate harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use Projected Gradient Descent (PGD) to optimize subtle perturbations that are embedded into benign audio carriers, such as weather queries or greeting messages. Validated under the rigorous StrongREJECT, LlamaGuard, as well as Human Evaluation safety evaluation framework, our experiments demonstrate a success rate exceeding 86% across Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating AI behavior.

🔍 Key Points

Introduction of WhisperInject, a two-stage framework for adversarial audio attacks on audio-language models that reveals critical vulnerabilities.
Development of Reinforcement Learning with Projected Gradient Descent (RL-PGD) to discover native harmful payloads from the model itself, increasing attack success rates.
Achieved over 86% success in manipulating state-of-the-art audio language models while maintaining audio imperceptibility to human listeners.
Highlights the transition from merely bypassing safety protocols to eliciting harmful model behaviors, shifting the focus of audio adversarial attacks.
Demonstration of the feasibility and practicality of stealthy attacks in real-world scenarios, warning of potential widespread risks.

💡 Why This Paper Matters

This paper provides critical insights into the vulnerabilities of audio-language models, detailing a practical framework that allows adversaries to induce harmful behavior covertly. It underscores the need for enhanced safety measures beyond traditional text filtering and reveals valuable directions for future research in AI safety and security practices.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are highly relevant to AI security researchers as they expose a significant gap in the protection of audio-language models against adversarial attacks. With the increasing integration of AI in everyday audio interactions, understanding and mitigating such vulnerabilities are crucial to ensuring the safety and reliability of AI systems in real-world applications.

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper