← Back to Library

LLMs Encode Harmfulness and Refusal Separately

Authors: Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi

Published: 2025-07-16

arXiv ID: 2507.11878v1

Added to Library: 2025-07-17 04:01 UTC

Red Teaming

πŸ“„ Abstract

LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model's internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model's internal belief of harmfulness. These insights lead to a practical safety application: The model's latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard model, across different jailbreak methods. Our findings suggest that LLMs' internal understanding of harmfulness is more robust than their refusal decision to diverse input instructions, offering a new perspective to study AI safety

πŸ” Key Points

  • LLMs encode harmfulness and refusal as separate internal concepts, leading to the development of distinct harmfulness and refusal directions in the latent space.
  • The harmfulness direction can cause LLMs to interpret harmless inputs as harmful, while the refusal direction elicits direct refusal responses without altering harmfulness judgments.
  • The proposed Latent Guard model leverages the internal harmfulness representation as a safeguard against harmful instructions, showing improved performance compared to traditional finetuned safeguard models.
  • The analysis of jailbreak methods indicates that some techniques suppress refusal signals without reversing the model’s belief in harmfulness, revealing vulnerabilities in LLM safety mechanisms.
  • Harmfulness representations are shown to be fine-grained and vary by risk categories, highlighting a more nuanced internal understanding of harmfulness in LLMs.

πŸ’‘ Why This Paper Matters

This paper is relevant as it provides foundational insights into the internal workings of large language models (LLMs) regarding harmfulness and refusal behaviors. By establishing that these concepts are encoded separately, it opens new avenues for improving AI safety mechanisms, particularly in the context of harmful instructions. The introduction of the Latent Guard model presents a practical application of these findings, showcasing the potential for more robust and efficient safety measures than existing models. With the growing concerns about safety in AI systems, this work is crucial for researchers aiming to build safer and more reliable AI technologies.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of interest to AI security researchers due to its novel approach to understanding and addressing safety concerns in LLMs. The separation of harmfulness and refusal representations provides a new theoretical framework for analyzing AI behavior, which is essential for developing effective safeguards against malicious use. Additionally, the practical implications of the Latent Guard model offer promising strategies for enhancing the robustness of AI systems against adversarial threats, making it a significant contribution to the field of AI security.

πŸ“š Read the Full Paper