← Back to Library

Emoji-Based Jailbreaking of Large Language Models

Authors: M P V S Gopinadh, S Mahaboob Hussain

Published: 2026-01-02

arXiv ID: 2601.00936v1

Added to Library: 2026-01-07 10:04 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) are integral to modern AI applications, but their safety alignment mechanisms can be bypassed through adversarial prompt engineering. This study investigates emoji-based jailbreaking, where emoji sequences are embedded in textual prompts to trigger harmful and unethical outputs from LLMs. We evaluated 50 emoji-based prompts on four open-source LLMs: Mistral 7B, Qwen 2 7B, Gemma 2 9B, and Llama 3 8B. Metrics included jailbreak success rate, safety alignment adherence, and latency, with responses categorized as successful, partial and failed. Results revealed model-specific vulnerabilities: Gemma 2 9B and Mistral 7B exhibited 10 % success rates, while Qwen 2 7B achieved full alignment (0% success). A chi-square test (chi^2 = 32.94, p < 0.001) confirmed significant inter-model differences. While prior works focused on emoji attacks targeting safety judges or classifiers, our empirical analysis examines direct prompt-level vulnerabilities in LLMs. The results reveal limitations in safety mechanisms and highlight the necessity for systematic handling of emoji-based representations in prompt-level safety and alignment pipelines.

🔍 Key Points

  • Investigation of emoji-based jailbreaking techniques revealing significant vulnerabilities in LLMs' safety alignment mechanisms.
  • Empirical analysis of 50 emoji-augmented prompts on four different LLMs, providing insights into model-specific weaknesses and varying success rates of jailbreaking attempts.
  • Identification of substantial performance discrepancies across models, with Qwen 2 7B demonstrating 0% success in jailbreak attempts while others had up to 10%.
  • Quantitative evaluation using statistical methods, confirming significant differences in model performance and ethical compliance rates through a chi-square test.
  • Discussion of the implications for model training and deployment, emphasizing the need for improved handling of non-textual prompts in LLM safety protocols.

💡 Why This Paper Matters

This paper highlights critical vulnerabilities in LLMs regarding emoji-based adversarial prompting, exposing limitations in current safety alignment mechanisms. Its findings underscore the necessity for adaptive and robust defenses against emerging threats in AI, particularly in the context of maintaining ethical standards in machine learning applications.

🎯 Why It's Interesting for AI Security Researchers

The study presents crucial insights into the robustness and safety of LLMs, a topic of significant concern in AI security research. By demonstrating how non-textual inputs can bypass existing safety measures, this work can inform the development of more resilient AI systems and contribute to the standardization of evaluation methods for LLMs, making it highly relevant for researchers focused on security and safety in AI.

📚 Read the Full Paper