Should LLM Safety Be More Than Refusing Harmful Instructions?

Authors: Utsav Maskey, Mark Dras, Usman Naseem

Published: 2025-06-03

arXiv ID: 2506.02442v2

Added to Library: 2025-06-05 01:00 UTC

Safety

📄 Abstract

This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for developing robust safety mechanisms.

🔍 Key Points

Introduction of a two-dimensional framework for assessing LLM safety: instruction refusal and generation safety.
Demonstration of vulnerabilities in LLMs to mismatched-generalization attacks particularly in handling encrypted texts.
Evaluation of various safety mechanisms, exposing their strengths and shortcomings in preventing harmful outputs while preserving utility.
Experimental findings indicating that pre-LLM methods struggle to accurately distinguish between benign and harmful encrypted inputs.
Post-LLM mechanisms succeed in suppressing harmful responses but often lead to over-refusal of benign instructions.

💡 Why This Paper Matters

This paper critically addresses the safety gaps in large language models by introducing a robust framework to assess their performance in handling long-tail encrypted texts. It highlights the necessity of improving AI safety mechanisms to prevent harmful outputs without sacrificing their utility, thereby contributing to the ongoing discourse about AI security.

🎯 Why It's Interesting for AI Security Researchers

The insights and evaluations presented in this paper hold significant importance for AI security researchers, as they outline both the theoretical and practical challenges of ensuring safe deployment of large language models. By exploring the cryptanalytic capabilities of LLMs and their vulnerabilities, the paper serves as a key resource for developing enhanced safety measures and understanding potential attack vectors in AI systems.

Should LLM Safety Be More Than Refusing Harmful Instructions?

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper