← Back to Library

SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models

Authors: Xin Gao, Shaohan Yu, Zerui Chen, Yueming Lyu, Weichen Yu, Guanghao Li, Jiyao Liu, Jianxiong Gao, Jian Liang, Ziwei Liu, Chenyang Si

Published: 2025-11-19

arXiv ID: 2511.15169v2

Added to Library: 2025-11-21 03:05 UTC

📄 Abstract

Large Reasoning Models (LRMs) improve answer quality through explicit chain-of-thought, yet this very capability introduces new safety risks: harmful content can be subtly injected, surface gradually, or be justified by misleading rationales within the reasoning trace. Existing safety evaluations, however, primarily focus on output-level judgments and rarely capture these dynamic risks along the reasoning process. In this paper, we present SafeRBench, the first benchmark that assesses LRM safety end-to-end -- from inputs and intermediate reasoning to final outputs. (1) Input Characterization: We pioneer the incorporation of risk categories and levels into input design, explicitly accounting for affected groups and severity, and thereby establish a balanced prompt suite reflecting diverse harm gradients. (2) Fine-Grained Output Analysis: We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units, enabling fine-grained evaluation across ten safety dimensions. (3) Human Safety Alignment: We validate LLM-based evaluations against human annotations specifically designed to capture safety judgments. Evaluations on 19 LRMs demonstrate that SafeRBench enables detailed, multidimensional safety assessment, offering insights into risks and protective mechanisms from multiple perspectives.

🔍 Key Points

  • Development of a comprehensive benchmark dataset with 847 adversarial test cases categorized into five types of prompt injection attacks, providing a systematic way to evaluate vulnerabilities in RAG systems.
  • Proposal of a multi-layered defense framework that includes content filtering, hierarchical prompt guardrails, and multi-stage response verification, significantly reducing prompt injection attack success rates from 73.2% to 8.7%.
  • Evaluation of the defense framework across seven state-of-the-art language models, showcasing varied vulnerabilities and demonstrating that it's possible to maintain 94.3% of baseline task performance while achieving high security against prompt injection attacks.
  • The introduction of novel mechanisms like embedding anomaly detection and structured prompt construction that help in distinguishing between benign and malicious content effectively.
  • Discussion of broader implications for AI security, emphasizing the need for depth in defenses against prompt injection vulnerabilities as a fundamental challenge in deploying AI models.

💡 Why This Paper Matters

This paper is relevant and important as it addresses a crucial vulnerability in AI systems incorporating language models, specifically in RAG frameworks. By establishing a robust benchmark and a practical defense strategy against prompt injection attacks, it contributes significant advancements to the field of AI security, illustrating that effective protective measures can coexist with practical performance standards.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of interest to AI security researchers as it provides insights into the vulnerabilities of retrieval-augmented generation systems, a prevalent architecture in modern AI applications. Furthermore, the proposed defenses are empirically evaluated, and the findings reveal critical implications for enhancing the security of AI agents, which align with ongoing research efforts to improve resilience against sophisticated adversarial attacks.

📚 Read the Full Paper