← Back to Library

Understanding and Preserving Safety in Fine-Tuned LLMs

Authors: Jiawen Zhang, Yangfan Hu, Kejia Chen, Lipeng He, Jiachen Ma, Jian Lou, Dan Li, Jian Liu, Xiaohu Yang, Ruoxi Jia

Published: 2026-01-15

arXiv ID: 2601.10141v1

Added to Library: 2026-01-16 03:03 UTC

Red Teaming Safety

📄 Abstract

Fine-tuning is an essential and pervasive functionality for applying large language models (LLMs) to downstream tasks. However, it has the potential to substantially degrade safety alignment, e.g., by greatly increasing susceptibility to jailbreak attacks, even when the fine-tuning data is entirely harmless. Despite garnering growing attention in defense efforts during the fine-tuning stage, existing methods struggle with a persistent safety-utility dilemma: emphasizing safety compromises task performance, whereas prioritizing utility typically requires deep fine-tuning that inevitably leads to steep safety declination. In this work, we address this dilemma by shedding new light on the geometric interaction between safety- and utility-oriented gradients in safety-aligned LLMs. Through systematic empirical analysis, we uncover three key insights: (I) safety gradients lie in a low-rank subspace, while utility gradients span a broader high-dimensional space; (II) these subspaces are often negatively correlated, causing directional conflicts during fine-tuning; and (III) the dominant safety direction can be efficiently estimated from a single sample. Building upon these novel insights, we propose safety-preserving fine-tuning (SPF), a lightweight approach that explicitly removes gradient components conflicting with the low-rank safety subspace. Theoretically, we show that SPF guarantees utility convergence while bounding safety drift. Empirically, SPF consistently maintains downstream task performance and recovers nearly all pre-trained safety alignment, even under adversarial fine-tuning scenarios. Furthermore, SPF exhibits robust resistance to both deep fine-tuning and dynamic jailbreak attacks. Together, our findings provide new mechanistic understanding and practical guidance toward always-aligned LLM fine-tuning.

🔍 Key Points

  • Identification of the safety-utility dilemma in fine-tuning LLMs, where prioritizing one often compromises the other.
  • Discovery that safety gradients reside in a low-rank subspace, while utility gradients span a higher-dimensional space, leading to directional conflicts during fine-tuning.
  • Introduction of Safety-Preserving Fine-tuning (SPF), which efficiently decouples utility updates from safety degradation by employing a projection mechanism.
  • Demonstration through experiments that SPF maintains high task performance while nearly recovering pre-trained safety alignment, even under adversarial conditions.
  • Establishment of a theoretical framework that guarantees utility convergence while providing bounds on safety drift.

💡 Why This Paper Matters

This paper is crucial as it addresses the urgent need to enhance safety alignment in fine-tuned large language models (LLMs), which have become susceptible to various vulnerabilities during the fine-tuning process. By presenting the SPF method and showcasing its effectiveness, the research provides a practical solution for improving both safety and utility in real-world applications of LLMs. This approach offers a pathway to maintain aligned LLM performance in high-stakes environments where safety is paramount.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper of great interest due to its focus on mitigating safety risks associated with LLM fine-tuning. The identified vulnerabilities and the proposed SPF method offer significant advancements in securing LLMs against adversarial attacks, thereby contributing to the development of more reliable AI systems. The intersection of safety alignment and model performance is particularly relevant for those working on AI accountability and ethical AI deployment.

📚 Read the Full Paper