← Back to Library

Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio

Authors: Guangke Chen, Yuhui Wang, Shouling Ji, Xiapu Luo, Ting Wang

Published: 2025-11-14

arXiv ID: 2511.10913v1

Added to Library: 2025-11-17 03:00 UTC

Red Teaming

📄 Abstract

Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation techniques (Concat, Shuffle) that conceal harmful content within text. The second leverages audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through auxiliary audio channels while maintaining benign textual prompts. Through evaluation across five commercial LALMs-based TTS systems and three datasets spanning two languages, we demonstrate that our attacks substantially reduce refusal rates and increase the toxicity of generated speech. We further assess both reactive countermeasures deployed by audio-streaming platforms and proactive defenses implemented by TTS providers. Our analysis reveals critical vulnerabilities: deepfake detectors underperform on high-fidelity audio; reactive moderation can be circumvented by adversarial perturbations; while proactive moderation detects 57-93% of attacks. Our work highlights a previously underexplored content-centric misuse vector for TTS and underscore the need for robust cross-modal safeguards throughout training and deployment.

🔍 Key Points

  • Introduces the concept of harmful content generation via text-to-speech systems, highlighting a shift from speaker impersonation to content misuse.
  • Presents HARMGEN, a suite of five novel attacks that exploit both text and audio modalities to bypass safety mechanisms in large audio-language models (LALMs).
  • Demonstrates that their attacks significantly reduce refusal rates (up to 100% acceptance) for generating harmful speech while increasing toxicity scores across multiple TTS models and datasets.
  • Identifies critical vulnerabilities in reactive and proactive defenses against harmful audio generation, emphasizing the ineffectiveness of current deepfake detection methodologies and the potential success of proactive moderation strategies.
  • Establishes the need for robust safeguards against content-centric TTS misuse, advocating for enhanced safety measures during model training and deployment.

💡 Why This Paper Matters

This paper is vital as it addresses the emerging threat landscape surrounding large text-to-speech models, particularly in generating harmful content. The innovative attacks presented provide a tangible understanding of how these models can be misused, leading to significant implications for security and moderation in audio content generation. It underscores the urgency for improved safety protocols in TTS systems to mitigate risks associated with their misuse, making a strong case for future research and policy considerations in AI governance.

🎯 Why It's Interesting for AI Security Researchers

The paper would interest AI security researchers due to its thorough investigation of the vulnerabilities in TTS systems, novel attack methodologies that employ both text and audio exploitation, and the emphasis on the necessity for improved defense mechanisms. It not only contributes to the understanding of content misuse in AI-generated audio but also provides actionable insights into designing better safeguards and moderation strategies. As the capabilities of TTS and audio-language models expand, understanding and preempting potential abuses becomes critical for maintaining the integrity and safety of audio content generation.

📚 Read the Full Paper