Jailbreak Strength and Model Similarity Predict Transferability

📄 Abstract

Jailbreaks pose an imminent threat to ensuring the safety of modern AI systems by enabling users to disable safeguards and elicit unsafe information. Sometimes, jailbreaks discovered for one model incidentally transfer to another model, exposing a fundamental flaw in safeguarding. Unfortunately, there is no principled approach to identify when jailbreaks will transfer from a source model to a target model. In this work, we observe that transfer success from a source model to a target model depends on quantifiable measures of both jailbreak strength with respect to the source model and the contextual representation similarity of the two models. Furthermore, we show transferability can be increased by distilling from the target model into the source model where the only target model responses used to train the source model are those to benign prompts. We show that the distilled source model can act as a surrogate for the target model, yielding more transferable attacks against the target model. These results suggest that the success of jailbreaks is not merely due to exploitation of safety training failing to generalize out-of-distribution, but instead a consequence of a more fundamental flaw in contextual representations computed by models.

🔍 Key Points

The paper establishes that the transfer success of jailbreaks from one language model to another correlates positively with jailbreak strength and contextual representation similarity between models.
It introduces a method to quantify jailbreak strength and evaluate representational similarity using a mutual k-nearest neighbors metric, crucial for understanding transferability.
The authors demonstrate that distilling benign responses from a target model into a source model can increase transferability without compromising safety mechanisms, suggesting a novel training approach to enhance model defenses.
It provides a systematic evaluation involving multiple models from different families, yielding insights into the security vulnerabilities of instruction-tuned language models.
The findings indicate that current safety mechanisms in language models are insufficient, necessitating improved defenses against jailbreaks.

💡 Why This Paper Matters

This paper significantly advances the understanding of how jailbreaks can transfer between language models, providing quantifiable measures for jailbreak strength and model similarity. By demonstrating a method to enhance model robustness through distillation, it offers new pathways for secure AI system design, making it a relevant contribution to the field of AI safety.

🎯 Why It's Interesting for AI Security Researchers

The insights from this paper would be of keen interest to AI security researchers because it rigorously explores the vulnerabilities of state-of-the-art language models through empirical analysis. Understanding the mechanisms behind jailbreak transferability is crucial for developing more effective security measures and can inform the design of AI systems that are resilient against adversarial threats.

Jailbreak Strength and Model Similarity Predict Transferability

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper