← Back to Library

Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning

Authors: Yanbo Wang, Minzheng Wang, Jian Liang, Lu Wang, Yongcan Yu, Ran He

Published: 2026-02-14

arXiv ID: 2602.13562v1

Added to Library: 2026-02-17 03:02 UTC

Safety

📄 Abstract

While reasoning models have achieved remarkable success in complex reasoning tasks, their increasing power necessitates stringent safety measures. For safety alignment, the core challenge lies in the inherent trade-off between safety and utility. However, prevailing alignment strategies typically construct CoT training data with explicit safety rules via context distillation. This approach inadvertently limits reasoning capabilities by creating a rigid association between rule memorization and refusal. To mitigate the safety-utility trade-off, we propose the Adaptive Safe Context Learning (ASCL) framework to improve the reasoning given proper context. ASCL formulates safety alignment as a multi-turn tool-use process, empowering the model to autonomously decide when to consult safety rules and how to generate the ongoing reasoning. Furthermore, to counteract the preference for rule consultation during RL, we introduce Inverse Frequency Policy Optimization (IFPO) to rebalance advantage estimates. By decoupling rule retrieval and subsequent reasoning, our method achieves higher overall performance compared to baselines.

🔍 Key Points

  • Introduction of the Adaptive Safe Context Learning (ASCL) framework to dynamically engage safety rules in LLM reasoning processes.
  • Utilization of Inverse Frequency Policy Optimization (IFPO) to mitigate biases in rule consultation during reinforcement learning.
  • Experimental validation showcasing ASCL's enhanced performance in achieving a balance between safety and utility compared to existing methods.
  • Detailed analysis revealing that over-refusal issues can be addressed by improving context management rather than rigidly enforcing safety rules.
  • Findings underscore the importance of separating safety alignment from the reasoning process to enable better decision-making in ambiguous scenarios.

💡 Why This Paper Matters

The paper presents a significant advancement in the safety alignment of large language models (LLMs) by introducing adaptive mechanisms that enhance their reasoning capabilities while maintaining critical safety standards. The proposed ASCL and IFPO frameworks provide tools for models to respond appropriately to sensitive prompts without excessive caution, ultimately improving their utility in practical applications.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of considerable interest to AI security researchers as it addresses the pressing challenge of ensuring safe AI deployment in real-world scenarios. The methods proposed for balancing safety and utility have direct implications for enhancing the trustworthiness of AI systems, a core concern in AI safety and security research. Additionally, the insights into the dynamics of rule application versus reasoning processes can contribute to developing more robust frameworks for AI alignment.

📚 Read the Full Paper