← Back to Library

Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?

Authors: Berk Atil, Rebecca J. Passonneau, Fred Morstatter

Published: 2025-11-01

arXiv ID: 2511.00689v2

Added to Library: 2025-11-05 05:00 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) undergo safety alignment after training and tuning, yet recent work shows that safety can be bypassed through jailbreak attacks. While many jailbreaks and defenses exist, their cross-lingual generalization remains underexplored. This paper presents the first systematic multilingual evaluation of jailbreaks and defenses across ten languages -- spanning high-, medium-, and low-resource languages -- using six LLMs on HarmBench and AdvBench. We assess two jailbreak types: logical-expression-based and adversarial-prompt-based. For both types, attack success and defense robustness vary across languages: high-resource languages are safer under standard queries but more vulnerable to adversarial ones. Simple defenses can be effective, but are language- and model-dependent. These findings call for language-aware and cross-lingual safety benchmarks for LLMs.

🔍 Key Points

  • This paper presents the first systematic multilingual evaluation of jailbreak attacks and defenses in large language models (LLMs) across ten languages, including high-, medium-, and low-resource languages.
  • The study identifies that the effectiveness of jailbreak attacks and defenses varies significantly across languages, with high-resource languages showing more success in adversarial attacks despite having safer responses under standard queries.
  • Two main types of jailbreak techniques were evaluated: logical-expression-based and adversarial-prompt-based, revealing that both methods succeed differently depending on the language and model.
  • Simple defense mechanisms, such as self-verification prompting and multilingual classifiers, were shown to effectively detect unsafe responses, although performance varies based on language and model configurations.
  • The findings highlight the need for developing language-aware safety benchmarks and alignment methods for LLMs to ensure equitable safeguards across diverse linguistic contexts.

💡 Why This Paper Matters

This paper is crucial for the field of AI safety as it systematically explores the vulnerabilities of LLMs across multiple languages, underscoring the importance of understanding linguistic variations in attack success and defense efficacy. This exploration can lead to better safety measures that accommodate diverse languages, enhancing the overall reliability of AI systems.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper relevant as it expands the body of knowledge on cross-lingual vulnerabilities of LLMs, particularly in the context of jailbreak attacks and defenses. By identifying languages that are more susceptible to adversarial exploitation and discussing effective defensive strategies, the paper contributes valuable insights for improving the security of AI models, paving the way for future research aimed at strengthening AI safety protocols.

📚 Read the Full Paper