← Back to Library

RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

Authors: Jingnan Zheng, Xiangtian Ji, Yijun Lu, Chenhang Cui, Weixiang Zhao, Gelei Deng, Zhenkai Liang, An Zhang, Tat-Seng Chua

Published: 2025-06-09

arXiv ID: 2506.07736v1

Added to Library: 2025-06-10 04:03 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) continue to exhibit vulnerabilities despite deliberate safety alignment efforts, posing significant risks to users and society. To safeguard against the risk of policy-violating content, system-level moderation via external guard models-designed to monitor LLM inputs and outputs and block potentially harmful content-has emerged as a prevalent mitigation strategy. Existing approaches of training guard models rely heavily on extensive human curated datasets and struggle with out-of-distribution threats, such as emerging harmful categories or jailbreak attacks. To address these limitations, we propose RSafe, an adaptive reasoning-based safeguard that conducts guided safety reasoning to provide robust protection within the scope of specified safety policies. RSafe operates in two stages: 1) guided reasoning, where it analyzes safety risks of input content through policy-guided step-by-step reasoning, and 2) reinforced alignment, where rule-based RL optimizes its reasoning paths to align with accurate safety prediction. This two-stage training paradigm enables RSafe to internalize safety principles to generalize safety protection capability over unseen or adversarial safety violation scenarios. During inference, RSafe accepts user-specified safety policies to provide enhanced safeguards tailored to specific safety requirements.

🔍 Key Points

  • RSafe introduces a two-stage safeguard mechanism consisting of guided reasoning and reinforced alignment to enhance the safety of LLM outputs.
  • It leverages reinforcement learning to optimize safety predictions while minimizing dependence on large labeled datasets, addressing limitations of existing guard models.
  • RSafe demonstrates superior generalization capabilities to unseen and adversarial safety violation scenarios, outperforming traditional guard models.
  • The framework offers adaptiveness by allowing users to specify safety policies during inference, thus enhancing customization for specific applications or emerging threats.
  • RSafe provides interpretability through human-readable safety judgments and reasoning traces, promoting transparency in decision-making.

💡 Why This Paper Matters

The RSafe framework is an innovative approach to addressing the safety challenges posed by large language models, providing a robust, customizable, and interpretable solution to safeguard against harmful content. Its emphasis on reasoning and reinforcement learning represents a significant shift in how guard models can evolve to meet new challenges in AI safety.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it tackles pressing issues regarding the robustness and adaptability of safeguards in LLMs, a key area of focus given the increasing integration of LLMs in sensitive applications. The novel techniques for adaptive reasoning and reinforcement learning-based safeguards offer new avenues for improving LLM safety, making the research highly applicable and relevant in the domain of AI security.

📚 Read the Full Paper