← Back to Library

Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

Authors: Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, Mingming Fan

Published: 2025-08-11

arXiv ID: 2508.08236v1

Added to Library: 2025-08-14 23:11 UTC

Safety

📄 Abstract

Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.

🔍 Key Points

  • Introduction of PsyCrisis-Bench as a novel reference-free evaluation benchmark specifically designed for assessing LLM safety alignment in high-risk mental health dialogues, particularly in the context of Chinese-language interactions.
  • Development of a manually curated dataset derived from real-world mental health dialogues, focusing on self-harm, suicidal ideation, and existential distress, providing a rich basis for evaluating safety alignment.
  • Implementation of an LLM-as-Judge approach combined with expert-defined reasoning chains to enable in-context evaluation of LLM responses, resulting in enhanced explainability and interpretability of evaluations.
  • Utilization of binary point-wise scoring across multiple safety dimensions to ensure judgments are transparent, traceable, and aligned with psychological intervention principles.
  • Demonstration of significant improvements in agreement with human expert evaluations and the production of interpretable rationales in experiments, outperforming previous baselines.

💡 Why This Paper Matters

This paper presents a significant advancement in the field of AI-driven mental health support by establishing a comprehensive approach to evaluate the safety alignment of large language models (LLMs) in high-risk dialogue scenarios. By addressing the ethical and practical challenges inherent in evaluating AI responses to sensitive mental health inquiries, the authors contribute essential tools and methodologies that not only enhance the functionality of LLMs in these contexts but also prioritize user safety and well-being. The establishment of PsyCrisis-Bench and the associated dataset mark an important step towards achieving reliable, trustworthy, and ethically sound AI applications in mental health support.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers as it delves into the critical intersection of AI safety and ethical considerations within mental health contexts. The research highlights the risks associated with deploying LLMs in sensitive applications, such as the potential for misalignments with safety principles and the necessity for robust evaluation frameworks. By providing novel evaluation methodologies and datasets, the study not only informs the development of safer AI models but also emphasizes the importance of interpretability and transparency in AI decision-making processes, which are essential for ensuring accountability in AI systems used in high-stakes environments like mental health.

📚 Read the Full Paper