← Back to Library

GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners

Authors: Haoran Li, Yulin Chen, Jingru Zeng, Hao Peng, Huihao Jing, Wenbin Hu, Xi Yang, Ziqian Zeng, Sirui Han, Yangqiu Song

Published: 2025-09-29

arXiv ID: 2509.24418v1

Added to Library: 2025-09-30 04:06 UTC

Safety

📄 Abstract

As large language models (LLMs) are increasingly integrated into numerous applications across various domains, LLMs' safety becomes a critical concern for both application developers and intended users. Currently, great efforts have been made to develop safety benchmarks with fine-grained taxonomies. However, these benchmarks' taxonomies are disparate with different safety policies. Thus, existing safeguards trained on these benchmarks are either coarse-grained to only distinguish between safe and unsafe, or constrained by the narrow risk taxonomies of a single benchmark. To leverage these fine-grained safety taxonomies across multiple safety benchmarks, in this paper, we propose GSPR, a Generalizable Safety Policy Reasoner to identify unsafe input prompts and LLMs' outputs with violated safety taxonomies through Group Relative Policy Optimization (GRPO). Unlike prior safeguards which only cover a fixed set of risk factors, our GSPR incentivizes its reasoning capability with varied safety taxonomies through our careful cold-start strategy and reward design. Consequently, our GSPR can be trained across multiple safety benchmarks with distinct taxonomies and naturally exhibits powerful generalization ability. We conduct extensive experiments to show that our GSPR significantly improves existing safety guardrails' reasoning capabilities for both safety and category prediction tasks. Moreover, our GSPR not only demonstrates powerful safety generalization abilities but also achieves the least inference token costs with explanations.

🔍 Key Points

  • Introduction of GSPR, a Generalizable Safety Policy Reasoner that utilizes Group Relative Policy Optimization (GRPO) to improve safety reasoning for large language models (LLMs) across various safety benchmarks.
  • Demonstration of enhanced flexibility in training by incorporating multiple safety taxonomies through a novel cold-start strategy, allowing GSPR to adapt to evolving safety requirements.
  • Significant improvements in predicting fine-grained safety categories, achieving over 45% improvement in accuracy compared to existing models.
  • Establishment of minimal inference token costs while maintaining clear safety reasoning explanations, indicating efficient utilization of computational resources during moderation tasks.
  • Extensive evaluations showcase GSPR's superior performance both in-domain and out-of-domain, validating its generalization capability across unfamiliar safety policies.

💡 Why This Paper Matters

The GSPR framework presents a substantial advancement in the safety alignment of large language models, addressing critical vulnerabilities in existing safety guardrails. By utilizing diverse safety taxonomies and improving interpretability through structured reasoning, this research provides a comprehensive approach to ensuring safety in AI applications. The implementation of efficient training and inference strategies further enhances its practicality, making it a valuable contribution to the field.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant due to its focus on developing robust defenses against the misuse of LLMs, which are prone to safety vulnerabilities. The ability to generalize across different safety policies while providing explainable reasoning mechanisms can aid researchers in building more secure and reliable AI systems. By addressing inadequacies in current safeguard methodologies, this research contributes crucial insights and tools that can shape future strategies for mitigating risks associated with AI language models.

📚 Read the Full Paper