← Back to Library

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

Authors: Yilin Jiang, Mingzi Zhang, Xuanyu Yin, Sheng Jin, Suyu Lu, Zuocan Ying, Zengyi Yu, Xiangjie Kong

Published: 2025-11-10

arXiv ID: 2511.06890v1

Added to Library: 2025-11-11 05:02 UTC

Safety

📄 Abstract

Large Language Models for Simulating Professions (SP-LLMs), particularly as teachers, are pivotal for personalized education. However, ensuring their professional competence and ethical safety is a critical challenge, as existing benchmarks fail to measure role-playing fidelity or address the unique teaching harms inherent in educational scenarios. To address this, we propose EduGuardBench, a dual-component benchmark. It assesses professional fidelity using a Role-playing Fidelity Score (RFS) while diagnosing harms specific to the teaching profession. It also probes safety vulnerabilities using persona-based adversarial prompts targeting both general harms and, particularly, academic misconduct, evaluated with metrics including Attack Success Rate (ASR) and a three-tier Refusal Quality assessment. Our extensive experiments on 14 leading models reveal a stark polarization in performance. While reasoning-oriented models generally show superior fidelity, incompetence remains the dominant failure mode across most models. The adversarial tests uncovered a counterintuitive scaling paradox, where mid-sized models can be the most vulnerable, challenging monotonic safety assumptions. Critically, we identified a powerful Educational Transformation Effect: the safest models excel at converting harmful requests into teachable moments by providing ideal Educational Refusals. This capacity is strongly negatively correlated with ASR, revealing a new dimension of advanced AI safety. EduGuardBench thus provides a reproducible framework that moves beyond siloed knowledge tests toward a holistic assessment of professional, ethical, and pedagogical alignment, uncovering complex dynamics essential for deploying trustworthy AI in education. See https://github.com/YL1N/EduGuardBench for Materials.

🔍 Key Points

  • Introduction of EduGuardBench, a benchmark that assesses the pedagogical fidelity and safety of Large Language Models (LLMs) acting as simulated teachers.
  • The benchmark consists of a dual-component structure: Teaching Harm Scenarios (THS) and Adversarial Safety Scenarios (ASS), assessing both professional fidelity and adversarial vulnerability.
  • Identification of the Educational Transformation Effect, highlighting how safe models can convert harmful requests into teachable moments, which indicates a new dimension of AI safety.
  • Extensive evaluation of 14 leading LLMs reveals that reasoning-oriented models generally exhibit higher fidelity, while incompetence remains a common failure mode across many models.
  • Uncovering a scaling paradox where mid-sized models demonstrated the highest vulnerabilities, challenging traditional assumptions about model safety related to size.

💡 Why This Paper Matters

The introduction of EduGuardBench fills a critical gap in the evaluation of LLMs in educational contexts, addressing both ethical and professional challenges associated with AI-powered teaching tools. By providing a systematic framework for understanding the safety and effectiveness of these models, it lays the groundwork for developing reliable pedagogical agents that ensure the welfare of students and uphold academic integrity.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of interest to AI security researchers as it tackles the pressing challenges of adversarial attacks and ethical failures specific to educational technologies. The findings regarding the vulnerability of models and the proposed methods to evaluate and mitigate these threats contribute to ongoing discussions about safety mechanisms in AI. Researchers can leverage the insights from EduGuardBench to develop more robust AI systems and ensure that advancements in LLM technology can be safely integrated into educational environments.

📚 Read the Full Paper