← Back to Library

Improving Methodologies for LLM Evaluations Across Global Languages

Authors: Akriti Vij, Benjamin Chua, Darshini Ramiah, En Qi Ng, Mahran Morsidi, Naga Nikshith Gangarapu, Sharmini Johnson, Vanessa Wilfred, Vikneswaran Kumaran, Wan Sie Lee, Wenzhuo Yang, Yongsen Zheng, Bill Black, Boming Xia, Frank Sun, Hao Zhang, Qinghua Lu, Suyu Ma, Yue Liu, Chi-kiu Lo, Fatemeh Azadi, Isar Nejadgholi, Sowmya Vajjala, Agnes Delaborde, Nicolas Rolin, Tom Seimandi, Akiko Murakami, Haruto Ishi, Satoshi Sekine, Takayuki Semitsu, Tasuku Sasaki, Angela Kinuthia, Jean Wangari, Michael Michie, Stephanie Kasaon, Hankyul Baek, Jaewon Noh, Kihyuk Nam, Sang Seo, Sungpil Shin, Taewhi Lee, Yongsu Kim, Daisy Newbold-Harrop, Jessica Wang, Mahmoud Ghanem, Vy Hong

Published: 2026-01-22

arXiv ID: 2601.15706v1

Added to Library: 2026-01-23 03:00 UTC

📄 Abstract

As frontier AI models are deployed globally, it is essential that their behaviour remains safe and reliable across diverse linguistic and cultural contexts. To examine how current model safeguards hold up in such settings, participants from the International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the EU, France, Kenya, South Korea and the UK conducted a joint multilingual evaluation exercise. Led by Singapore AISI, two open-weight models were tested across ten languages spanning high and low resourced groups: Cantonese English, Farsi, French, Japanese, Korean, Kiswahili, Malay, Mandarin Chinese and Telugu. Over 6,000 newly translated prompts were evaluated across five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both LLM-as-a-judge and human annotation. The exercise shows how safety behaviours can vary across languages. These include differences in safeguard robustness across languages and harm types and variation in evaluator reliability (LLM-as-judge vs. human review). Further, it also generated methodological insights for improving multilingual safety evaluations, such as the need for culturally contextualised translations, stress-tested evaluator prompts and clearer human annotation guidelines. This work represents an initial step toward a shared framework for multilingual safety testing of advanced AI systems and calls for continued collaboration with the wider research community and industry.

🔍 Key Points

  • Introduction of the Global Optimization for Safety Vector Extraction (GOSV) framework to identify safety-critical attention heads in LLMs, addressing limitations of local attribution methods.
  • Discovery of two sets of safety vectors—Malicious Injection Vectors and Safety Suppression Vectors—indicating that aligned LLMs have distinct pathways for safety mechanisms.
  • Identification of a threshold where complete safety breakdown occurs when approximately 30% of total attention heads are manipulated, providing insights into the structural vulnerabilities of LLMs.
  • Development of a novel inference-time white-box jailbreak method that significantly outperforms existing methods, demonstrating the practical implications of the findings on LLM safety.
  • Systematic empirical evaluation across multiple models showing consistent patterns and effectiveness of the GOSV framework in enhancing LLM safety interpretability.

💡 Why This Paper Matters

This paper provides crucial insights into the vulnerabilities and safety mechanisms of Large Language Models (LLMs) by introducing a novel framework for identifying critical safety components. The findings highlight the importance of understanding the interdependencies within LLM architectures, which is essential for developing more robust safety measures. By revealing that safety vulnerabilities can be exploited even when traditional alignment strategies are employed, the research underscores the need for more sophisticated safety mechanisms in AI systems. The effective jailbreak method developed offers a practical demonstration of these vulnerabilities, pushing the pursuit of safer and more interpretable AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers as it addresses pressing concerns about the safety and integrity of Large Language Models (LLMs). The exploration of systematic vulnerabilities and the development of effective exploitation techniques provide a framework for understanding potential attack vectors on LLMs. Moreover, the paper contributes to ongoing discussions about AI alignment and safety, emphasizing the importance of considering both the structural and operational aspects of LLMs in security research. As models become more integrated into critical applications, understanding these vulnerabilities is vital for crafting effective defenses and improving AI safety measures.

📚 Read the Full Paper