← Back to Library

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Authors: Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan

Published: 2025-06-10

arXiv ID: 2506.08473v1

Added to Library: 2025-06-11 04:00 UTC

Safety

📄 Abstract

Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. In this paper, building on the concept of alignment direction -- defined by the weight difference between aligned and unaligned models -- we observe that perturbations along this direction preserve model safety. In contrast, perturbations along directions orthogonal to this alignment are strongly linked to harmful direction perturbations, rapidly degrading safety and framing the parameter space as a narrow safety basin. Based on this insight, we propose a methodology for safety fine-tuning called AsFT (Anchoring Safety in Fine-Tuning), which integrates a regularization term into the training objective. This term uses the alignment direction as an anchor to suppress updates in harmful directions, ensuring that fine-tuning is constrained within the narrow safety basin. Extensive experiments on multiple datasets show that AsFT outperforms Safe LoRA, reducing harmful behavior by 7.60 percent, improving model performance by 3.44 percent, and maintaining robust performance across various experimental settings. Code is available at https://github.com/PKU-YuanGroup/AsFT

🔍 Key Points

  • The paper introduces AsFT (Anchoring Safety in Fine-Tuning), a new methodology that constrains perturbations during LLM fine-tuning within a narrow safety basin, preserving the model's safety.
  • AsFT utilizes the concept of alignment direction to guide updates and suppress harmful perturbations effectively, achieving a notable reduction in harmful behaviors compared to existing methods.
  • Extensive experimental results demonstrate that AsFT outperforms Safe LoRA by reducing harmful outputs by 7.60% while also enhancing model performance by 3.44%.
  • The authors conduct thorough analyses of the safety landscape, identifying the dynamics of safety-related parameter updates and establishing a mathematical framework for understanding safety basins.
  • The methodology is validated across various datasets and experimental setups, highlighting its robustness and practical applicability in real-world scenarios.

💡 Why This Paper Matters

This paper presents a significant advance in the safety of large language models during fine-tuning by introducing the AsFT methodology, which effectively mitigates the risk of harmful outputs while maintaining or improving model performance. Its findings are pivotal given the increasing integration of LLMs in sensitive applications, where safety and ethical considerations are paramount.

🎯 Why It's Interesting for AI Security Researchers

The implications of this research are crucial for AI security researchers, as it tackles the pressing issue of safety in fine-tuning large language models, a vulnerable stage where even slight perturbations can lead to catastrophic failures. The proposed AsFT methodology not only offers a promising approach to enhance model resilience but also adds to the body of knowledge on aligning AI systems with ethical standards, making it a valuable resource for developing safer AI technologies.

📚 Read the Full Paper