Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

📄 Abstract

Despite substantial efforts in safety alignment, recent research indicates that Large Language Models (LLMs) remain highly susceptible to jailbreak attacks. Among these attacks, finetuning-based ones that compromise LLMs' safety alignment via fine-tuning stand out due to its stable jailbreak performance. In particular, a recent study indicates that fine-tuning with as few as 10 harmful question-answer (QA) pairs can lead to successful jailbreaking across various harmful questions. However, such malicious fine-tuning attacks are readily detectable and hence thwarted by moderation models. In this paper, we demonstrate that LLMs can be jailbroken by fine-tuning with only 10 benign QA pairs; our attack exploits the increased sensitivity of LLMs to fine-tuning data after being overfitted. Specifically, our fine-tuning process starts with overfitting an LLM via fine-tuning with benign QA pairs involving identical refusal answers. Further fine-tuning is then performed with standard benign answers, causing the overfitted LLM to forget the refusal attitude and thus provide compliant answers regardless of the harmfulness of a question. We implement our attack on the ten LLMs and compare it with five existing baselines. Experiments demonstrate that our method achieves significant advantages in both attack effectiveness and attack stealth. Our findings expose previously unreported security vulnerabilities in current LLMs and provide a new perspective on understanding how LLMs' security is compromised, even with benign fine-tuning. Our code is available at https://github.com/ZHIXINXIE/tenBenign.

🔍 Key Points

Introduction of a novel two-stage attack method that uses benign QA pairs for jailbreak attacks on LLMs.
Demonstration of how overfitting to benign refusal answers can lead to catastrophic forgetting of safety features in LLMs.
Comparative evaluation of effectiveness and stealthiness against existing attack methods, showing comparable or superior results with benign data.
Identification of vulnerabilities in LLMs that arise from the sensitivity introduced by overfitting, providing a fresh perspective on LLM security attacks.
Empirical analysis of attack metrics across multiple models, demonstrating high attack success rates without detection by moderation systems.

💡 Why This Paper Matters

This paper highlights a critical vulnerability in LLMs' defenses against jailbreak techniques, emphasizing the potential for benign data to be weaponized against safety measures. Its findings underscore the importance of reevaluating current defenses and raise awareness about the limits of safety alignments, making it a significant contribution to the field of AI security.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper particularly relevant due to its identification of a new attack vector that uses benign data to compromise model integrity. The innovative use of overfitting as a mechanism to induce vulnerabilities in LLMs presents a fresh perspective on the weaknesses of existing moderation techniques, paving the way for further research into effective countermeasures.

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper