← Back to Library

Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs

Authors: Shuang Ao, Yi Dong, Jinwei Hu, Sarvapali Ramchurn

Published: 2025-06-21

arXiv ID: 2506.18931v1

Added to Library: 2025-06-25 04:01 UTC

Safety

📄 Abstract

Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enhances adaptability while reducing computational costs. However, fine-tuning can compromise safety alignment, even with benign data, increasing susceptibility to harmful outputs. Existing safety alignment methods struggle to capture complex parameter shifts, leading to suboptimal safety-utility trade-offs. To address this issue, we propose Safe Pruning LoRA (SPLoRA), a novel pruning-based approach that selectively removes LoRA layers that weaken safety alignment, improving safety while preserving performance. At its core, we introduce Empirical-DIEM (E-DIEM), a dimension-insensitive similarity metric that effectively detects safety misalignment in LoRA-adapted models. We conduct extensive experiments on LLMs fine-tuned with mixed of benign and malicious data, and purely benign datasets, evaluating SPLoRA across utility, safety, and reliability metrics. Results demonstrate that SPLoRA outperforms state-of-the-art safety alignment techniques, significantly reducing safety risks while maintaining or improving model performance and reliability. Additionally, SPLoRA reduces inference overhead, making it a scalable and efficient solution for deploying safer and more reliable LLMs. The code is available at https://github.com/AoShuang92/SPLoRA.

🔍 Key Points

  • Safe Pruning LoRA (SPLoRA) is proposed as a novel approach to improve safety alignment in fine-tuned LLMs by selectively pruning LoRA layers that weaken safety, thereby enhancing model reliability and performance.
  • The introduction of Empirical-Dimension Insensitive Euclidean Metric (E-DIEM) allows for effective detection of safety misalignment in LoRA layers, overcoming limitations of traditional similarity measures in high-dimensional spaces.
  • Extensive experiments demonstrate that SPLoRA outperforms existing state-of-the-art safety alignment techniques, achieving significant reductions in safety risks while maintaining or improving utility metrics like ROUGE and METEOR.
  • The technique not only improves safety without compromising performance but also reduces inference overhead, making it a more efficient solution for deploying robust LLMs.
  • The paper highlights a new method of combining theoretical foundations with practical applications to ensure safer AI systems while leveraging the efficiencies introduced by LoRA.

💡 Why This Paper Matters

The paper presents Safe Pruning LoRA, an innovative method addressing a critical challenge in the adaptation of Large Language Models to maintain safety alignment during fine-tuning. By effectively isolating and pruning layers that compromise safety, this research provides a valuable framework that enhances both the safety and performance of LLMs, which are increasingly being employed in sensitive applications. The implications of this work suggest a step forward in designing responsible AI systems that meet safety standards while retaining adaptability and efficiency.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is particularly important because it tackles the intricate balance between model utility and safety—a pressing concern in deploying AI systems in the real world. The novel E-DIEM metric provides a rigorous approach to assessing safety risks associated with fine-tuning, thus contributing to methodologies aimed at mitigating harmful outputs from LLMs. Furthermore, the results demonstrate practical techniques that could be adopted in the construction of safer AI models, making it a significant contribution to the field of AI security.

📚 Read the Full Paper