Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications

Authors: Jia Yi Goh, Shaun Khoo, Nyx Iskandar, Gabriel Chua, Leanne Tan, Jessica Foo

Published: 2025-07-13

arXiv ID: 2507.09820v1

Added to Library: 2025-07-15 04:00 UTC

Safety

📄 Abstract

Most safety testing efforts for large language models (LLMs) today focus on evaluating foundation models. However, there is a growing need to evaluate safety at the application level, as components such as system prompts, retrieval pipelines, and guardrails introduce additional factors that significantly influence the overall safety of LLM applications. In this paper, we introduce a practical framework for evaluating application-level safety in LLM systems, validated through real-world deployment across multiple use cases within our organization. The framework consists of two parts: (1) principles for developing customized safety risk taxonomies, and (2) practices for evaluating safety risks in LLM applications. We illustrate how the proposed framework was applied in our internal pilot, providing a reference point for organizations seeking to scale their safety testing efforts. This work aims to bridge the gap between theoretical concepts in AI safety and the operational realities of safeguarding LLM applications in practice, offering actionable guidance for safe and scalable deployment.

🔍 Key Points

Introduction of a practical framework for evaluating safety risks specific to large language model (LLM) applications, differing from traditional foundation model evaluations.
Development of a customized safety risk taxonomy tailored to the specific contexts and requirements of organizations using LLMs, enhancing the applicability of safety evaluations.
Implementation of automated safety testing practices that treat LLM applications as a black box, improving real-world applicability and facilitating ongoing safety assessments after deployment.
Case studies demonstrating the application of the framework, providing practical insights into the efficacy of various testing methods and the importance of context in safety evaluations.
Identification of various risk categories and subcategories that can guide organizations in their safety assessments, aligning with legal and regulatory frameworks.

💡 Why This Paper Matters

This paper presents critical methodologies for enhancing the safety evaluations of LLM applications through tailored frameworks and practical applications. By addressing the specific complexities of application-level safety risks, it fills a significant gap in existing AI safety literature and practice. The proposed methods provide organizations with actionable guidelines for identifying, assessing, and mitigating safety risks intrinsic to LLM technologies, ultimately promoting safer deployment in real-world environments.

🎯 Why It's Interesting for AI Security Researchers

The paper would interest AI security researchers as it introduces novel methodologies and frameworks specifically aimed at addressing the safety risks associated with LLM applications. Its focus on practical implementation, real-world case studies, and custom risk taxonomies provides significant guidance for conducting comprehensive safety evaluations—a key concern in the responsible deployment of AI technologies. Additionally, the frameworks proposed could inspire further research into automated safety testing and the impact of various system components on the overall safety of AI applications.

Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper