SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

📄 Abstract

Jailbreak attacks pose a serious threat to the safety of Large Language Models (LLMs) by crafting adversarial prompts that bypass alignment mechanisms, causing the models to produce harmful, restricted, or biased content. In this paper, we propose SafeLLM, a novel unlearning-based defense framework that unlearn the harmful knowledge from LLMs while preserving linguistic fluency and general capabilities. SafeLLM employs a three-stage pipeline: (1) dynamic unsafe output detection using a hybrid approach that integrates external classifiers with model-internal evaluations; (2) token-level harmful content tracing through feedforward network (FFN) activations to localize harmful knowledge; and (3) constrained optimization to suppress unsafe behavior without degrading overall model quality. SafeLLM achieves targeted and irreversible forgetting by identifying and neutralizing FFN substructures responsible for harmful generation pathways. Extensive experiments on prominent LLMs (Vicuna, LLaMA, and GPT-J) across multiple jailbreak benchmarks show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance. Compared to standard defense methods such as supervised fine-tuning and direct preference optimization, SafeLLM offers stronger safety guarantees, more precise control over harmful behavior, and greater robustness to unseen attacks. Moreover, SafeLLM maintains the general performance after the harmful knowledge unlearned. These results highlight unlearning as a promising direction for scalable and effective LLM safety.

🔍 Key Points

Introduction of SafeLLM, a novel unlearning-based defense framework to combat jailbreak attacks on large language models (LLMs), focusing on maintaining linguistic fluency and general capabilities while suppressing harmful outputs.
SafeLLM features a three-stage pipeline for harmful content management: 1) dynamic detection of unsafe outputs, 2) token-level tracing of harmful knowledge through feedforward network (FFN) activations, and 3) constrained optimization to achieve irreversible forgetting of harmful knowledge.
Extensive experimental results demonstrate that SafeLLM significantly reduces attack success rates across multiple jailbreak benchmarks while preserving high general-purpose performance, surpassing traditional defense methods such as supervised fine-tuning and direct preference optimization.
The methodology addresses both known and unseen jailbreak attacks, showcasing that SafeLLM's token-level unlearning mechanism provides robustness against emerging attack vectors in advanced LLMs.
SafeLLM serves not only to improve safety measures but also contributes to the broader field by pioneering targeted interventions in model design, underscoring the importance of proactive defenses in AI security.

💡 Why This Paper Matters

The introduction of SafeLLM represents a significant advancement in the defense against jailbreak attacks on large language models, offering a scalable and effective approach to unlearning harmful knowledge while maintaining the overall functionality of the models. This work is relevant as it underscores the necessity of robust safety mechanisms in AI systems, particularly as they become increasingly integrated into sensitive applications. SafeLLM sets a new standard for AI safety methodologies, presenting comprehensive solutions that enhance user trust and model reliability.

🎯 Why It's Interesting for AI Security Researchers

This paper is pertinent to AI security researchers as it addresses a critical issue of safety in large language models, which have become vulnerable to jailbreak attacks that compromise their integrity and reliability. The novel approaches proposed by SafeLLM in dynamic detection and token-level unlearning provide fresh insights and robust methodologies for enhancing model security. The implications for proactive defense strategies against adversarial attacks reflect the growing challenges in AI security, making this research a valuable contribution to the field.

SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper