← Back to Library

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Authors: Jianfeng Si, Lin Sun, Weihong Lin, Xiangzheng Zhang

Published: 2026-02-06

arXiv ID: 2602.06650v1

Added to Library: 2026-02-09 03:04 UTC

Safety

📄 Abstract

Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.

🔍 Key Points

  • Introduction of PACT (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control in LLMs that mitigates the safety-helpfulness trade-off.
  • Development of a hierarchical policy architecture consisting of a global safety policy that enforces strict boundaries and user-defined policies that allow for specific, flexible control.
  • Implementation of a CoTPath mechanism that decomposes safety decisions into structured reasoning paths enhancing interpretability and controllability during runtime.
  • Extensive experimental validation demonstrating PACT's superior performance on safety and helpfulness metrics compared to state-of-the-art models, establishing a new baseline for controllability.
  • Release of the PACT model suite and evaluation protocols for reproducible research, contributing to the broader AI safety community.

💡 Why This Paper Matters

This paper is significant as it addresses a critical challenge in AI safety and helpfulness by proposing a novel framework (PACT) that combines robust global safety measures with flexible, user-defined policies. It offers an innovative approach to enhance the response controllability of large language models (LLMs) in safety-critical applications, potentially improving user trust and utility.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers because it presents foundational work on a dynamic safety control mechanism that could set new standards for LLM deployment in sensitive domains. By exploring hierarchical policies and risk-aware reasoning, researchers can gain insights into developing safer AI systems that can adapt to diverse operational contexts while minimizing threats.

📚 Read the Full Paper