Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

📄 Abstract

Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications. In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries. However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overrefusal to benign queries. Building on this mechanistic analysis, we propose a method that explicitly considers refusal triggers in the safety alignment fine-tuning. Empirical results demonstrate that our approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods. Warning: this paper contains harmful and biased sentences.

🔍 Key Points

The paper introduces a comprehensive security threat analysis framework, specifically targeting the operational lifecycle of autonomous LLM agents like OpenClaw, structured across five stages: initialization, input, inference, decision, and execution.
It identifies critical threats such as indirect prompt injection, memory poisoning, and intent drift, demonstrating their prevalence and impact through detailed case studies.
The authors highlight the limitations of existing defense mechanisms, emphasizing their inadequacy against cross-temporal and multi-stage threats typical in autonomous LLM operations.
A layered defense architecture is proposed, integrating measures across all lifecycle stages, which is critical for effective mitigation of identified threats.
The paper emphasizes the need for holistic security architectures that enforce strict controls at every stage of the agent's lifecycle.

💡 Why This Paper Matters

This paper is vital as it addresses an important gap in the understanding of security vulnerabilities in autonomous LLM agents like OpenClaw. By elaborating on the complex threat landscape and proposing a structured framework for analysis and defense, it provides critical insights for developing more resilient AI systems. The findings underscore the urgent need for robust security measures to handle the sophisticated attacks that such agents may face in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper highly relevant because it not only elucidates the specific vulnerabilities present in autonomous LLM agents but also offers a novel framework for thinking about security in this context. The insights into multi-stage attacks and the proposed defense mechanisms are crucial for developing future AI security protocols. Moreover, the emphasis on lifecycle-aware defenses aligns closely with current trends towards comprehensive and proactive security solutions in AI systems.

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper