IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

📄 Abstract

Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied. We introduce \textbf{Indic Jailbreak Robustness (IJR)}, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks. IJR reveals three patterns. (1) Contracts inflate refusals but do not stop jailbreaks: in JSON, LLaMA and Sarvam exceed 0.92 JSR, and in Free all models reach 1.0 with refusals collapsing. (2) English to Indic attacks transfer strongly, with format wrappers often outperforming instruction wrappers. (3) Orthography matters: romanized or mixed inputs reduce JSR under JSON, with correlations to romanization share and tokenization (approx 0.28 to 0.32) indicating systematic effects. Human audits confirm detector reliability, and lite-to-full comparisons preserve conclusions. IJR offers a reproducible multilingual stress test revealing risks hidden by English-only, contract-focused evaluations, especially for South Asian users who frequently code-switch and romanize.

🔍 Key Points

Introduces the Indic Jailbreak Robustness (IJR) benchmark, a judge-free evaluation of adversarial safety for 12 South Asian languages, expanding the scope of safety assessments to multilingual contexts.
Demonstrates that contract-bound evaluations overestimate safety, revealing a 'contract gap' where models behave differently under forced compliance versus natural language settings.
Highlights the phenomenon of cross-lingual transfer vulnerability, showing that English adversarial prompts can effectively jailbreak models in Indic languages, with stronger transfer from format wrappers compared to instruction wrappers.
Identifies the impact of orthographic variation (native vs. romanized forms) on jailbreak success rates, indicating that mixed or romanized inputs can lower model efficacy and increase vulnerabilities.
Validates the findings through extensive human audits and reproducibility tests, confirming the robustness and reliability of the proposed benchmark methodology.

💡 Why This Paper Matters

The paper contributes significantly to the field of AI safety by addressing the previously overlooked vulnerabilities in South Asian languages and emphasizing the need for multilingual evaluation benchmarks. The findings challenge traditional English-centric assessments, making IJR an essential resource for understanding and mitigating risks in diverse language contexts.

🎯 Why It's Interesting for AI Security Researchers

This research is crucial for AI security researchers as it provides new insights into the vulnerabilities of language models when faced with multilingual and culturally specific adversarial attacks. By identifying weaknesses in existing models through a comprehensive and reproducible framework, it lays the groundwork for improved safety mechanisms and highlights the need for inclusive evaluation strategies that account for linguistic diversity.

IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper