RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

Authors: Jingnan Zheng, Xiangtian Ji, Yijun Lu, Chenhang Cui, Weixiang Zhao, Gelei Deng, Zhenkai Liang, An Zhang, Tat-Seng Chua

Published: 2025-06-09

arXiv ID: 2506.07736v1

Added to Library: 2025-06-10 04:03 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) continue to exhibit vulnerabilities despite deliberate safety alignment efforts, posing significant risks to users and society. To safeguard against the risk of policy-violating content, system-level moderation via external guard models-designed to monitor LLM inputs and outputs and block potentially harmful content-has emerged as a prevalent mitigation strategy. Existing approaches of training guard models rely heavily on extensive human curated datasets and struggle with out-of-distribution threats, such as emerging harmful categories or jailbreak attacks. To address these limitations, we propose RSafe, an adaptive reasoning-based safeguard that conducts guided safety reasoning to provide robust protection within the scope of specified safety policies. RSafe operates in two stages: 1) guided reasoning, where it analyzes safety risks of input content through policy-guided step-by-step reasoning, and 2) reinforced alignment, where rule-based RL optimizes its reasoning paths to align with accurate safety prediction. This two-stage training paradigm enables RSafe to internalize safety principles to generalize safety protection capability over unseen or adversarial safety violation scenarios. During inference, RSafe accepts user-specified safety policies to provide enhanced safeguards tailored to specific safety requirements.

🔍 Key Points

RSafe introduces a two-stage safeguard mechanism consisting of guided reasoning and reinforced alignment to enhance the safety of LLM outputs.
It leverages reinforcement learning to optimize safety predictions while minimizing dependence on large labeled datasets, addressing limitations of existing guard models.
RSafe demonstrates superior generalization capabilities to unseen and adversarial safety violation scenarios, outperforming traditional guard models.
The framework offers adaptiveness by allowing users to specify safety policies during inference, thus enhancing customization for specific applications or emerging threats.
RSafe provides interpretability through human-readable safety judgments and reasoning traces, promoting transparency in decision-making.

💡 Why This Paper Matters

The RSafe framework is an innovative approach to addressing the safety challenges posed by large language models, providing a robust, customizable, and interpretable solution to safeguard against harmful content. Its emphasis on reasoning and reinforcement learning represents a significant shift in how guard models can evolve to meet new challenges in AI safety.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it tackles pressing issues regarding the robustness and adaptability of safeguards in LLMs, a key area of focus given the increasing integration of LLMs in sensitive applications. The novel techniques for adaptive reasoning and reinforcement learning-based safeguards offer new avenues for improving LLM safety, making the research highly applicable and relevant in the domain of AI security.

RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper