← Back to Library

JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification

Authors: Xi Wang, Songlei Jian, Shasha Li, Xiaopeng Li, Zhaoye Li, Bin Ji, Baosheng Wang, Jie Yu

Published: 2026-01-06

arXiv ID: 2601.03005v1

Added to Library: 2026-01-07 10:02 UTC

Red Teaming

📄 Abstract

Despite extensive safety alignment, Large Language Models (LLMs) often fail against jailbreak attacks. While machine unlearning has emerged as a promising defense by erasing specific harmful parameters, current methods remain vulnerable to diverse jailbreaks. We first conduct an empirical study and discover that this failure mechanism is caused by jailbreaks primarily activating non-erased parameters in the intermediate layers. Further, by probing the underlying mechanism through which these circumvented parameters reassemble into the prohibited output, we verify the persistent existence of dynamic $\textbf{jailbreak paths}$ and show that the inability to rectify them constitutes the fundamental gap in existing unlearning defenses. To bridge this gap, we propose $\textbf{J}$ailbreak $\textbf{P}$ath $\textbf{U}$nlearning (JPU), which is the first to rectify dynamic jailbreak paths towards safety anchors by dynamically mining on-policy adversarial samples to expose vulnerabilities and identify jailbreak paths. Extensive experiments demonstrate that JPU significantly enhances jailbreak resistance against dynamic attacks while preserving the model's utility.

🔍 Key Points

  • Identification of dynamic jailbreak paths as a major vulnerability area in existing unlearning defenses for Large Language Models (LLMs).
  • Introduction of Jailbreak Path Unlearning (JPU), a novel framework that rectifies these dynamic jailbreak paths by dynamically mining on-policy adversarial samples.
  • Demonstrated significant improvement in jailbreak resistance across various attack scenarios while preserving model utility in extensive experiments.
  • The methodology includes on-policy attack buffer mining, jailbreak path identification, and constrained path rectification to effectively enhance defense mechanisms.

💡 Why This Paper Matters

The paper presents a groundbreaking approach to defend Large Language Models against jailbreak attacks through its proposed method, JPU, which effectively addresses an identified critical gap in existing defenses. By focusing on dynamic vulnerabilities and rectifying direct paths to harms, this research advances the field of AI safety significantly and provides a robust framework for enhancing model security without sacrificing performance.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it tackles the pressing issue of jailbreak attacks that threaten the ethical deployment of LLMs. The novel methodologies introduced, particularly the JPU framework, provide essential insights and practical strategies for developing more secure AI systems, making it critical reading for those focused on fortifying AI against adversarial manipulation.

📚 Read the Full Paper