← Back to Library

ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs

Authors: Yibo Zhang, Liang Lin

Published: 2025-09-14

arXiv ID: 2509.11128v1

Added to Library: 2025-09-16 04:00 UTC

📄 Abstract

The widespread application of Large Speech Models (LSMs) has made their security risks increasingly prominent. Traditional speech adversarial attack methods face challenges in balancing effectiveness and stealth. This paper proposes Evolutionary Noise Jailbreak (ENJ), which utilizes a genetic algorithm to transform environmental noise from a passive interference into an actively optimizable attack carrier for jailbreaking LSMs. Through operations such as population initialization, crossover fusion, and probabilistic mutation, this method iteratively evolves a series of audio samples that fuse malicious instructions with background noise. These samples sound like harmless noise to humans but can induce the model to parse and execute harmful commands. Extensive experiments on multiple mainstream speech models show that ENJ's attack effectiveness is significantly superior to existing baseline methods. This research reveals the dual role of noise in speech security and provides new critical insights for model security defense in complex acoustic environments.

🔍 Key Points

  • Emojis can significantly enhance the probability of toxicity generation in LLMs by acting as sensitive word substitutes, leading to higher harmfulness compared to plain-text prompts.
  • A robust experimental framework evaluated toxicity generation across various mainstream languages and popular LLMs, demonstrating the cross-linguistic transferability of emoji-induced toxicity.
  • The research identifies that emojis bypass safety mechanisms in LLMs through tokenization disparities, creating a heterogeneous semantic expression channel that diminishes sensitivity to harmful prompts.
  • Analysis of pre-training corpora reveals the presence of emojis within toxic contexts, indicating a link between data pollution and LLMs' ability to generate toxic content when prompted with emojis.
  • A comprehensive understanding of how emojis impact LLMs' response generation was achieved through model-level interpretation, spanning cognitive processing and tokenization discrepancies.

💡 Why This Paper Matters

This paper is crucial because it uncovers the dark potential of emojis in digital communication through LLMs, highlighting significant vulnerabilities in AI systems. The findings stress the need for improved safety measures that address not only verbatim toxicity but also subtle forms of dangerous content stemming from widely accepted non-verbal cues like emojis. This marks an important step in reinforcing the integrity of AI-generated outputs across diverse communication contexts.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper presents vital insights into the vulnerabilities of LLMs regarding emoji use, a seemingly benign aspect of communication that can lead to harmful outputs. Understanding how emojis can leverage existing weaknesses in AI models can inform the development of more robust safety alignment mechanisms, enhance content moderation strategies, and ultimately mitigate risks of AI systems being manipulated for malicious purposes.

📚 Read the Full Paper