← Back to Library

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Authors: Dongkyu Derek Cho, Huan Song, Arijit Ghosh Chowdhury, Haotian An, Yawei Wang, Rohit Thekkanal, Negin Sokhandan, Sharlina Keshava, Hannah Marlowe

Published: 2025-11-26

arXiv ID: 2511.21050v1

Added to Library: 2025-11-27 03:01 UTC

Safety

📄 Abstract

Fine-tuning large language models (LLMs) for downstream tasks typically exhibit a fundamental safety-capability tradeoff, where improving task performance degrades safety alignment even on benign datasets. This degradation persists across standard approaches including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.

🔍 Key Points

  • The paper presents Reinforcement Learning with Verifiable Rewards (RLVR) as a method that enhances the performance of large language models (LLMs) while preserving safety, effectively breaking the traditional trade-off between safety and capability.
  • Theoretical derivations offer upper bounds on safety drift during KL-constrained optimization which articulate the conditions where safety degradation is eliminated while improving model capabilities.
  • Extensive empirical experiments across five adversarial benchmarks show that RLVR maintains or improves safety guardrails while also enhancing reasoning capabilities.
  • The comprehensive analysis includes ablation studies examining the effects of various optimization algorithms, scales of models (7B vs 32B), and task domains, showing negligible safety degradation compared to supervised fine-tuning (SFT).
  • The findings challenge the prevailing belief that improving LLM task performance inherently compromises safety, suggesting robust training methodologies can attain both objectives.

💡 Why This Paper Matters

The paper highlights a significant advancement in the fine-tuning of LLMs, proposing RLVR as a promising solution that aligns model performance with necessary safety standards. This advancement not only broadens the scope of safety-assured model deployment but also sparks a re-evaluation of existing paradigms regarding the safety-capability trade-off, providing a foundation for future enhancements in AI safety mechanisms.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper highly relevant as it addresses a critical concern in the deployment of AI systems: balancing model capabilities with safety. The exploration and validation of RLVR could lead to more secure AI applications by ensuring that performance improvements do not compromise safety, thereby contributing to the development of more reliable and accountable AI systems.

📚 Read the Full Paper