Should LLM Safety Be More Than Refusing Harmful Instructions?

📄 Abstract

This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for developing robust safety mechanisms.

🔍 Key Points

Introduces a two-dimensional framework for assessing the safety of LLMs that includes instruction refusal and generation safety, highlighting critical safety failure scenarios in handling long-tail encrypted texts.
Findings reveal LLMs exhibit vulnerabilities to mismatched-generalization attacks, resulting in unsafe model responses or excessive refusals when exposed to harmful encrypted instructions.
Empirical evaluation of multiple safety mechanisms demonstrates inadequate performance in current pre-LLM and post-LLM defenses, emphasizing the defectiveness of existing approaches in distinguishing between harmful and benign encrypted content.
The paper evaluates various defense mechanisms (e.g., perplexity filters, self-examination, LLaMA Guard) to identify their strengths and weaknesses, stressing the requirement for better models that intuitively understand encrypted messages.
Proposes future directions focused on enhancing LLM safety by integrating comprehension of encrypted content into both pre-model and post-model safety mechanisms.

💡 Why This Paper Matters

This paper is significant as it addresses critical gaps in the safety protocols of large language models, specifically regarding their interaction with encrypted texts. By establishing a structured evaluation framework and empirically validating the performance of LLMs under adversarial conditions, it highlights both vulnerabilities and potential paths for robust improvements. As LLMs become increasingly integrated into applications requiring secure and safe interactions, understanding their weaknesses and enhancing their safety is essential for responsible AI deployment.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of interest to AI security researchers because it delves into the underexplored area of LLM vulnerabilities related to cryptanalysis and adversarial attacks. It presents empirical evaluations that reveal the limitations of existing safety measures, making it crucial for researchers focused on developing new safeguards against misuse. Furthermore, the discussed framework and findings can guide future research efforts aimed at creating more secure AI models capable of gracefully handling complex input scenarios without compromising safety.

Should LLM Safety Be More Than Refusing Harmful Instructions?

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper