← Back to Library

When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity

Authors: Shiyao Cui, Xijia Feng, Yingkang Wang, Junxiao Yang, Zhexin Zhang, Biplab Sikdar, Hongning Wang, Han Qiu, Minlie Huang

Published: 2025-09-14

arXiv ID: 2509.11141v1

Added to Library: 2025-09-16 04:00 UTC

Red Teaming

📄 Abstract

Emojis are globally used non-verbal cues in digital communication, and extensive research has examined how large language models (LLMs) understand and utilize emojis across contexts. While usually associated with friendliness or playfulness, it is observed that emojis may trigger toxic content generation in LLMs. Motivated by such a observation, we aim to investigate: (1) whether emojis can clearly enhance the toxicity generation in LLMs and (2) how to interpret this phenomenon. We begin with a comprehensive exploration of emoji-triggered LLM toxicity generation by automating the construction of prompts with emojis to subtly express toxic intent. Experiments across 5 mainstream languages on 7 famous LLMs along with jailbreak tasks demonstrate that prompts with emojis could easily induce toxicity generation. To understand this phenomenon, we conduct model-level interpretations spanning semantic cognition, sequence generation and tokenization, suggesting that emojis can act as a heterogeneous semantic channel to bypass the safety mechanisms. To pursue deeper insights, we further probe the pre-training corpus and uncover potential correlation between the emoji-related data polution with the toxicity generation behaviors. Supplementary materials provide our implementation code and data. (Warning: This paper contains potentially sensitive contents)

🔍 Key Points

  • Emojis can significantly enhance the probability of toxicity generation in LLMs by acting as sensitive word substitutes, leading to higher harmfulness compared to plain-text prompts.
  • A robust experimental framework evaluated toxicity generation across various mainstream languages and popular LLMs, demonstrating the cross-linguistic transferability of emoji-induced toxicity.
  • The research identifies that emojis bypass safety mechanisms in LLMs through tokenization disparities, creating a heterogeneous semantic expression channel that diminishes sensitivity to harmful prompts.
  • Analysis of pre-training corpora reveals the presence of emojis within toxic contexts, indicating a link between data pollution and LLMs' ability to generate toxic content when prompted with emojis.
  • A comprehensive understanding of how emojis impact LLMs' response generation was achieved through model-level interpretation, spanning cognitive processing and tokenization discrepancies.

💡 Why This Paper Matters

This paper is crucial because it uncovers the dark potential of emojis in digital communication through LLMs, highlighting significant vulnerabilities in AI systems. The findings stress the need for improved safety measures that address not only verbatim toxicity but also subtle forms of dangerous content stemming from widely accepted non-verbal cues like emojis. This marks an important step in reinforcing the integrity of AI-generated outputs across diverse communication contexts.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper presents vital insights into the vulnerabilities of LLMs regarding emoji use, a seemingly benign aspect of communication that can lead to harmful outputs. Understanding how emojis can leverage existing weaknesses in AI models can inform the development of more robust safety alignment mechanisms, enhance content moderation strategies, and ultimately mitigate risks of AI systems being manipulated for malicious purposes.

📚 Read the Full Paper