← Back to Library

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment

Authors: Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, Kui Ren

Published: 2025-09-29

arXiv ID: 2509.24384v1

Added to Library: 2025-09-30 04:00 UTC

Red Teaming

📄 Abstract

The alignment of large language models (LLMs) with human values is critical for their safe deployment, yet jailbreak attacks can subvert this alignment to elicit harmful outputs from LLMs. In recent years, a proliferation of jailbreak attacks has emerged, accompanied by diverse metrics and judges to assess the harmfulness of the LLM outputs. However, the absence of a systematic benchmark to assess the quality and effectiveness of these metrics and judges undermines the credibility of the reported jailbreak effectiveness and other risks. To address this gap, we introduce HarmMetric Eval, a comprehensive benchmark designed to support both overall and fine-grained evaluation of harmfulness metrics and judges. Our benchmark includes a high-quality dataset of representative harmful prompts paired with diverse harmful and non-harmful model responses, alongside a flexible scoring mechanism compatible with various metrics and judges. With HarmMetric Eval, our extensive experiments uncover a surprising result: two conventional metrics--METEOR and ROUGE-1--outperform LLM-based judges in evaluating the harmfulness of model responses, challenging prevailing beliefs about LLMs' superiority in this domain. Our dataset is publicly available at https://huggingface.co/datasets/qusgo/HarmMetric_Eval, and the code is available at https://anonymous.4open.science/r/HarmMetric-Eval-4CBE.

🔍 Key Points

  • Development of HarmMetric Eval, a benchmark for assessing the effectiveness of harmfulness metrics and judges for large language models (LLMs).
  • Introduction of three core criteria (unsafe, relevant, useful) for evaluating harmfulness in model responses, enhancing the understanding of what constitutes harmful behavior.
  • Surprising findings that conventional metrics like METEOR and ROUGE-1 outperform LLM-based judges in evaluating harmfulness, challenging assumptions about the superiority of LLM capabilities over traditional metrics.
  • Creation of a diverse and high-quality dataset with 238 harmful prompts and over 3,300 corresponding responses to facilitate systematic evaluation of harmfulness metrics.
  • Proposes a flexible scoring mechanism that enables compatibility across various metrics and judges, allowing for standardized assessment of effectiveness.

💡 Why This Paper Matters

The introduction of HarmMetric Eval provides a critical tool for systematically evaluating harmfulness metrics and judges in the context of LLMs. By highlighting discrepancies between conventional metrics and LLM-based evaluations, this paper encourages a reevaluation of current methodologies, advocating for a more rigorous approach to assessing harmfulness in AI outputs. The implications extend beyond academic research, impacting the deployment and refinement of safer AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant for AI security researchers as it addresses the urgent issue of LLM vulnerabilities to jailbreak attacks and the challenges of aligning their outputs with human values. By establishing a standard for evaluating harmfulness metrics, it aids researchers in identifying weaknesses in current assessment methodologies, ensuring that future defenses against harmful behaviors are more robust and effective.

📚 Read the Full Paper