Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix

📄 Abstract

As large language models (LLMs) become increasingly integrated into real-world applications, scalable and rigorous safety evaluation is essential. This paper introduces Aymara AI, a programmatic platform for generating and administering customized, policy-grounded safety evaluations. Aymara AI transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments. We demonstrate its capabilities through the Aymara LLM Risk and Responsibility Matrix, which evaluates 20 commercially available LLMs across 10 real-world safety domains. Results reveal wide performance disparities, with mean safety scores ranging from 86.2% to 52.4%. While models performed well in well-established safety domains such as Misinformation (mean = 95.7%), they consistently failed in more complex or underspecified domains, notably Privacy & Impersonation (mean = 24.3%). Analyses of Variance confirmed that safety scores differed significantly across both models and domains (p < .05). These findings underscore the inconsistent and context-dependent nature of LLM safety and highlight the need for scalable, customizable tools like Aymara AI to support responsible AI development and oversight.

🔍 Key Points

Introduction of Aymara AI as a scalable tool for safety evaluation of LLMs, transforming safety policies into evaluative frameworks.
Comprehensive evaluation of 20 LLMs across 10 safety domains, revealing significant performance disparities in handling safety risks.
Identification of specific weaknesses in LLMs, particularly in complex domains like Privacy & Impersonation and Unqualified Professional Advice.
Use of an AI-based scoring mechanism validated against human assessments, facilitating a new approach to AI safety measurement justifications.

💡 Why This Paper Matters

The research presented illuminates the complex landscape of AI safety, offering a novel framework (Aymara AI) that fosters the development of generative AI models that are not only effective but also safe. It highlights the often-untapped potential for creating customizable safety evaluations tailored to varied real-world contexts, reinforcing the importance of integrating rigorous safety assessments in AI development.

🎯 Why It's Interesting for AI Security Researchers

This paper addresses the pressing need for effective safety evaluation tools in AI, particularly as LLMs are being increasingly deployed in sensitive areas. Its findings are essential for AI security researchers focused on understanding and mitigating the risks associated with LLM outputs. The identification of specific weaknesses within widely used models offers critical insights for guiding future improvements in AI safety and reliability.

Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper