Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

📄 Abstract

Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.

🔍 Key Points

Introduction of PACT (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control in LLMs that mitigates the safety-helpfulness trade-off.
Development of a hierarchical policy architecture consisting of a global safety policy that enforces strict boundaries and user-defined policies that allow for specific, flexible control.
Implementation of a CoTPath mechanism that decomposes safety decisions into structured reasoning paths enhancing interpretability and controllability during runtime.
Extensive experimental validation demonstrating PACT's superior performance on safety and helpfulness metrics compared to state-of-the-art models, establishing a new baseline for controllability.
Release of the PACT model suite and evaluation protocols for reproducible research, contributing to the broader AI safety community.

💡 Why This Paper Matters

This paper is significant as it addresses a critical challenge in AI safety and helpfulness by proposing a novel framework (PACT) that combines robust global safety measures with flexible, user-defined policies. It offers an innovative approach to enhance the response controllability of large language models (LLMs) in safety-critical applications, potentially improving user trust and utility.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers because it presents foundational work on a dynamic safety control mechanism that could set new standards for LLM deployment in sensitive domains. By exploring hierarchical policies and risk-aware reasoning, researchers can gain insights into developing safer AI systems that can adapt to diverse operational contexts while minimizing threats.

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper