← Back to Library

Dual-Space Smoothness for Robust and Balanced LLM Unlearning

Authors: Han Yan, Zheyuan Liu, Meng Jiang

Published: 2025-09-27

arXiv ID: 2509.23362v1

Added to Library: 2025-09-30 04:03 UTC

Red Teaming

📄 Abstract

With the rapid advancement of large language models, Machine Unlearning has emerged to address growing concerns around user privacy, copyright infringement, and overall safety. Yet state-of-the-art (SOTA) unlearning methods often suffer from catastrophic forgetting and metric imbalance, for example by over-optimizing one objective (e.g., unlearning effectiveness, utility preservation, or privacy protection) at the expense of others. In addition, small perturbations in the representation or parameter space can be exploited by relearn and jailbreak attacks. To address these challenges, we propose PRISM, a unified framework that enforces dual-space smoothness in representation and parameter spaces to improve robustness and balance unlearning metrics. PRISM consists of two smoothness optimization stages: (i) a representation space stage that employs a robustly trained probe to defend against jailbreak attacks, and (ii) a parameter-space stage that decouples retain-forget gradient conflicts, reduces imbalance, and smooths the parameter space to mitigate relearning attacks. Extensive experiments on WMDP and MUSE, across conversational-dialogue and continuous-text settings, show that PRISM outperforms SOTA baselines under multiple attacks while achieving a better balance among key metrics.

🔍 Key Points

  • Introduction of PRISM, a novel unlearning framework that employs dual-space smoothness in representation and parameter spaces to enhance robustness and balance among unlearning metrics.
  • Demonstrated that current state-of-the-art (SOTA) methods suffer from catastrophic forgetting and trade-offs between unlearning effectiveness, utility preservation, and robustness.
  • Extensive experimental validation shows PRISM significantly outperforms existing approaches under various attacks (relearning and jailbreak) while maintaining a better balance of key metrics across conversational-dialogue and continuous-text settings.
  • PRISM utilizes a min-max optimization approach to decouple retain-forget gradient conflicts, promoting robust performance against adversarial attacks.
  • Ablation studies confirm that all components of PRISM contribute significantly to its performance, indicating the critical balance needed between forgetting knowledge and preserving utility.

💡 Why This Paper Matters

This paper presents a comprehensive approach to resolving critical challenges in Machine Unlearning for large language models. By introducing PRISM, researchers can effectively enhance the robustness of language models against unlearning-related threats while ensuring a better balance between unlearning and model utility. This dual-space smoothness approach addresses significant gaps in existing methods, thus providing a pathway for safer and more efficient handling of sensitive data in AI systems.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant due to its focus on enhancing the robustness of large language models against unlearning vulnerabilities, such as relearning and jailbreak attacks. The findings and framework proposed herein not only contribute to the security aspects of AI models but also pave the way for developing safer data handling practices in compliance with privacy regulations, marking a significant contribution to the field of AI security.

📚 Read the Full Paper