SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory

📄 Abstract

LLMs increasingly exhibit over-refusal behavior, where safety mechanisms cause models to reject benign instructions that superficially resemble harmful content. This phenomena diminishes utility in production applications that repeatedly rely on common prompt templates or applications that frequently rely on LLMs for specific tasks (e.g. sentiment analysis, language translation). Through comprehensive evaluation, we demonstrate that LLMs still tend to refuse responses to harmful instructions when those instructions are reframed to appear as benign tasks. Our mechanistic analysis reveal that LLMs follow distinct "constellation" patterns in embedding space as representations traverse layers, with each task maintaining consistent trajectories that shift predictably between refusal and non-refusal cases. We introduce SafeConstellations, an inference-time trajectory-shifting approach that tracks task-specific trajectory patterns and guides representations toward non-refusal pathways. By selectively guiding model behavior only on tasks prone to over-refusal, and by preserving general model behavior, our method reduces over-refusal rates by up to 73% with minimal impact on utility-offering a principled approach to mitigating over-refusals.

🔍 Key Points

Development of SafeConstellations, a novel inference-time trajectory-shifting approach to reduce over-refusal rates in LLMs by up to 73%.
Establishment of a benchmark dataset to measure task-specific over-refusal rates in NLP applications, highlighting the variability of safety needs across different tasks.
Mechanistic analysis revealing that LLMs follow distinct 'constellation' patterns in embedding space that correspond to task identity, allowing for precise steering interventions to mitigate over-refusals.
Proposed dynamic layer selection and task-specific steering techniques enable targeted intervention without significant impact on general model behavior or utility.

💡 Why This Paper Matters

This paper presents a significant advancement in addressing over-refusal behavior in Large Language Models by introducing SafeConstellations. By leveraging task-specific trajectory patterns, the approach not only increases the usability of LLMs in sensitive contexts but also ensures the models remain cautious towards genuinely harmful instructions. This methodological innovation will likely enhance the deployment of LLMs in practical applications where utility and safety are critical.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it addresses the common challenge of over-refusal in LLMs, a phenomenon that can hinder model efficiency and responsiveness in real-world applications. Understanding and mitigating these dynamics directly relates to ensuring the safety and robustness of AI systems, making this research particularly relevant for developing secure AI technologies that align model behavior with human intentions while minimizing risks.

SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper