IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

📄 Abstract

As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.

🔍 Key Points

Introduction of IndicSafe, the first benchmark for evaluating multilingual LLM safety in 12 Indic languages, revealing significant safety drift and inconsistency across models and languages.
Created a dataset of 6,000 culturally grounded prompts addressing complex social issues in South Asia which reflects underrepresented sociocultural dynamics in LLM evaluations.
Analysis showed cross-language agreement was only 12.8%, indicating significant variability in safety judgments among language-specific versions of the same prompt.
Developed new quantitative metrics, such as cross-language consistency and category bias scores, to comprehensively assess safety alignment across different Indic languages and LLMs.
Found high rates of ambiguity and refusal bias in low-resource languages, raising concern over the reliability of LLM outputs in multilingual contexts.

💡 Why This Paper Matters

The IndicSafe benchmark represents a substantial advancement in the evaluation of large language models for safety in multilingual and low-resource contexts, particularly within South Asia. By highlighting the discrepancies in model behavior across languages and the susceptibility to cultural nuances, this paper underscores the urgent need for rigorous safety assessments that accommodate varying sociocultural dynamics. As LLMs become integrated into various applications in diverse linguistic landscapes, ensuring reliable and safe outputs becomes crucial for such technologies to be ethically deployed.

🎯 Why It's Interesting for AI Security Researchers

This paper is vital for AI security researchers as it sheds light on the critical issue of safety in multilingual AI systems, particularly within underrepresented and culturally diverse contexts. The findings indicate that existing models may not perform equally well across languages, leading to potential biases and harmful outputs. Understanding these dynamics is essential for developing more robust, culturally aware AI systems that can mitigate risks and enhance safety in their deployment.

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper