← Back to Library

TeleAI-Safety: A comprehensive LLM jailbreaking benchmark towards attacks, defenses, and evaluations

Authors: Xiuyuan Chen, Jian Zhao, Yuxiang He, Yuan Xun, Xinwei Liu, Yanshu Li, Huilin Zhou, Wei Cai, Ziyan Shi, Yuchen Yuan, Tianle Zhang, Chi Zhang, Xuelong Li

Published: 2025-12-05

arXiv ID: 2512.05485v1

Added to Library: 2025-12-08 03:01 UTC

Red Teaming Safety

📄 Abstract

While the deployment of large language models (LLMs) in high-value industries continues to expand, the systematic assessment of their safety against jailbreak and prompt-based attacks remains insufficient. Existing safety evaluation benchmarks and frameworks are often limited by an imbalanced integration of core components (attack, defense, and evaluation methods) and an isolation between flexible evaluation frameworks and standardized benchmarking capabilities. These limitations hinder reliable cross-study comparisons and create unnecessary overhead for comprehensive risk assessment. To address these gaps, we present TeleAI-Safety, a modular and reproducible framework coupled with a systematic benchmark for rigorous LLM safety evaluation. Our framework integrates a broad collection of 19 attack methods (including one self-developed method), 29 defense methods, and 19 evaluation methods (including one self-developed method). With a curated attack corpus of 342 samples spanning 12 distinct risk categories, the TeleAI-Safety benchmark conducts extensive evaluations across 14 target models. The results reveal systematic vulnerabilities and model-specific failure cases, highlighting critical trade-offs between safety and utility, and identifying potential defense patterns for future optimization. In practical scenarios, TeleAI-Safety can be flexibly adjusted with customized attack, defense, and evaluation combinations to meet specific demands. We release our complete code and evaluation results to facilitate reproducible research and establish unified safety baselines.

🔍 Key Points

  • Introduction of TeleAI-Safety, a comprehensive benchmark and modular framework for evaluating LLMs against jailbreak attacks, integrating 19 attack methods, 29 defense methods, and 19 evaluation methods.
  • The framework features two self-developed methods, including Morpheus, a metacognitive multi-round attack agent, and RADAR, a multi-agent debate-based safety evaluation method, to advance LLM safety research.
  • Extensive experimental evaluations reveal systematic vulnerabilities in common LLMs and quantify safety-utility trade-offs, highlighting model-specific behaviors across different attack vectors and risk categories.
  • Development of a curated attack corpus consisting of 342 samples spanning 12 risk categories, enhancing the systematic study and comparison of LLM vulnerabilities.
  • Contributions towards standardizing safety assessment methodologies, providing a reproducible and extensible platform for future research and evaluation in AI safety.

💡 Why This Paper Matters

The TeleAI-Safety framework serves as a pivotal advancement in LLM safety research by systematically addressing gaps in prior evaluations and providing an integrated benchmark that enhances the understanding of vulnerabilities, defenses, and safety assessments. This work opens up new pathways for improving the robustness of AI applications in sensitive domains.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it tackles critical issues in the evaluation and defense mechanisms of large language models against emerging threats. The novel methodologies and comprehensive benchmark provided by TeleAI-Safety can facilitate deeper investigations into vulnerabilities, promoting advancements in safer AI deployments and contributing to the broader discourse on AI ethics and security.

📚 Read the Full Paper