← Back to Library

MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking

Authors: Iago Alves Brito, Walcy Santos Rezende Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues GalvΓ£o Filho

Published: 2026-01-07

arXiv ID: 2601.04389v1

Added to Library: 2026-01-09 03:02 UTC

Red Teaming

πŸ“„ Abstract

Current safety evaluations of large language models (LLMs) create a dangerous illusion of universality, aggregating "Identity Hate" into scalar scores that mask systemic vulnerabilities against specific populations. To expose this selective safety, we introduce MiJaBench, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. By generating 528,000 prompt-response pairs from 12 state-of-the-art LLMs, we curate MiJaBench-Align, revealing that safety alignment is not a generalized semantic capability but a demographic hierarchy: defense rates fluctuate by up to 33\% within the same model solely based on the target group. Crucially, we demonstrate that model scaling exacerbates these disparities, suggesting that current alignment techniques do not create principle of non-discrimination but reinforces memorized refusal boundaries only for specific groups, challenging the current scaling laws of security. We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.

πŸ” Key Points

  • Introduction of MiJaBench, a bilingual adversarial benchmark exposing biases in LLMs with 44,000 prompts across 16 minority groups.
  • Finding that defense rates for hate speech vary significantly up to 33% depending on demographic identity, revealing a demographic hierarchy in model safety alignment.
  • Model scaling exacerbates disparities in safety alignment, indicating that larger models may increase vulnerability for marginalized groups while appearing to improve average safety metrics.
  • Cross-linguistic analysis shows that these biases are not confined to English but persist in Portuguese, confirming deeper structural issues in model safety across languages.
  • Development of an LLM-as-Judge protocol, which provides a systematic approach for evaluating the safety responses of LLMs to adversarial prompts.

πŸ’‘ Why This Paper Matters

The findings of this paper highlight significant flaws in current safety alignment mechanisms in large language models, calling for a reevaluation of how these systems are trained and audited. By exposing a systematic bias that favors certain demographic groups while leaving others vulnerable to hate speech, this research contributes vital insights for the development of fairer AI systems. Moreover, the release of the MiJaBench dataset allows for further investigation into these issues, promoting accountability and improvement in the future of language models.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers because it uncovers critical biases within large language models that could be exploited for malicious purposes. Understanding how demographic factors influence the safety of AI responses is essential for developing more robust and equitable AI systems. Furthermore, the introduction of the MiJaBench benchmark allows researchers to systematically study and address these vulnerabilities, making it a valuable tool for enhancing AI safety protocols.

πŸ“š Read the Full Paper