Linearly Decoding Refused Knowledge in Aligned Language Models

📄 Abstract

Most commonly used language models (LMs) are instruction-tuned and aligned using a combination of fine-tuning and reinforcement learning, causing them to refuse users requests deemed harmful by the model. However, jailbreak prompts can often bypass these refusal mechanisms and elicit harmful responses. In this work, we study the extent to which information accessed via jailbreak prompts is decodable using linear probes trained on LM hidden states. We show that a great deal of initially refused information is linearly decodable. For example, across models, the response of a jailbroken LM for the average IQ of a country can be predicted by a linear probe with Pearson correlations exceeding $0.8$. Surprisingly, we find that probes trained on base models (which do not refuse) sometimes transfer to their instruction-tuned versions and are capable of revealing information that jailbreaks decode generatively, suggesting that the internal representations of many refused properties persist from base LMs through instruction-tuning. Importantly, we show that this information is not merely "leftover" in instruction-tuned models, but is actively used by them: we find that probe-predicted values correlate with LM generated pairwise comparisons, indicating that the information decoded by our probes align with suppressed generative behavior that may be expressed more subtly in other downstream tasks. Overall, our results suggest that instruction-tuning does not wholly eliminate or even relocate harmful information in representation space-they merely suppress its direct expression, leaving it both linearly accessible and indirectly influential in downstream behavior.

🔍 Key Points

The study reveals that a significant amount of information that is initially refused by instruction-tuned language models is still linearly decodable from their hidden states using linear probes, showing a correlation with jailbroken responses.
Probes trained on base models can often effectively predict the responses from their instruction-tuned counterparts, indicating that harmful information representations persist even after instruction-tuning.
The research suggests that instruction-tuning does not completely remove or relocate harmful knowledge in model representation space, but rather suppresses its expression, allowing it to be accessed in subtle ways later.
The findings raise critical concerns about the effectiveness of alignment techniques in ensuring that LMs do not generate harmful outputs, pointing to the limitations of current safety mechanisms.

💡 Why This Paper Matters

This paper is critical as it systematically investigates the resilience of harmful information in aligned language models, demonstrating that techniques meant to restrict this information do not effectively eliminate it. It highlights the implications for both the development of safer AI systems and the need for more robust methods of ensuring their alignment with ethical standards.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would be particularly interested in this paper as it reveals vulnerabilities in the alignment frameworks of language models. Understanding how harmful information can be accessed even in allegedly safe models is crucial for developing better safety measures and ensuring robust alignments that do not compromise ethical standards in AI deployment.

Linearly Decoding Refused Knowledge in Aligned Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper