AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

📄 Abstract

The release of open-weight large language models (LLMs) creates a tension between advancing accessible research and preventing misuse, such as malicious fine-tuning to elicit harmful content. Current safety measures struggle to preserve the general capabilities of the LLM while resisting a determined adversary with full access to the model's weights and architecture, who can use full-parameter fine-tuning to erase existing safeguards. To address this, we introduce AntiDote, a bi-level optimization procedure for training LLMs to be resistant to such tampering. AntiDote involves an auxiliary adversary hypernetwork that learns to generate malicious Low-Rank Adaptation (LoRA) weights conditioned on the defender model's internal activations. The defender LLM is then trained with an objective to nullify the effect of these adversarial weight additions, forcing it to maintain its safety alignment. We validate this approach against a diverse suite of 52 red-teaming attacks, including jailbreak prompting, latent space manipulation, and direct weight-space attacks. AntiDote is upto 27.4\% more robust against adversarial attacks compared to both tamper-resistance and unlearning baselines. Crucially, this robustness is achieved with a minimal trade-off in utility, incurring a performance degradation of upto less than 0.5\% across capability benchmarks including MMLU, HellaSwag, and GSM8K. Our work offers a practical and compute efficient methodology for building open-weight models where safety is a more integral and resilient property.

🔍 Key Points

Introduction of AntiDote: A novel bi-level optimization framework that incorporates an adversarial hypernetwork to enhance the tamper-resistance of large language models (LLMs).
Demonstration of significant robustness against diverse adversarial attacks, with AntiDote showing up to 27.4% improved defense compared to existing methods, maintaining an impressive performance drop of less than 0.5% on utility benchmarks.
Implementation of a decoupled training strategy that separates capability preservation from safety enhancement, allowing concurrent optimization without significant trade-offs, fundamentally breaking the historical safety-utility trade-off.
Extensive empirical validation across 52 red-teaming attack vectors, establishing AntiDote as a state-of-the-art solution in LLM security, rendering it uniquely capable of defending against complex hijacking attempts.

💡 Why This Paper Matters

The research presented in this paper is critical as it addresses the growing security concerns surrounding large language models, especially in light of their widespread deployment in various applications. By demonstrating a robust methodology to cultivate inherent resilience to adversarial attacks, this work not only contributes to the theoretical understanding of LLM vulnerabilities but also provides practical solutions for building safer AI systems.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are highly pertinent to AI security researchers as they outline a proactive framework for developing secure and tamper-resistant LLMs. The introduction of AntiDote and its validation against a comprehensive suite of adversarial attacks could inform future works on AI safety, providing methodologies to defend against evolving threats in AI systems. This paper lays the groundwork for further research into resilience and safety mechanisms that are crucial for maintaining the integrity of AI technologies.

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper