AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Authors: Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan

Published: 2025-06-10

arXiv ID: 2506.08473v2

Added to Library: 2025-06-12 01:01 UTC

Safety

📄 Abstract

Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. In this paper, building on the concept of alignment direction -- defined by the weight difference between aligned and unaligned models -- we observe that perturbations along this direction preserve model safety. In contrast, perturbations along directions orthogonal to this alignment are strongly linked to harmful direction perturbations, rapidly degrading safety and framing the parameter space as a narrow safety basin. Based on this insight, we propose a methodology for safety fine-tuning called AsFT (Anchoring Safety in Fine-Tuning), which integrates a regularization term into the training objective. This term uses the alignment direction as an anchor to suppress updates in harmful directions, ensuring that fine-tuning is constrained within the narrow safety basin. Extensive experiments on multiple datasets show that AsFT outperforms Safe LoRA, reducing harmful behavior by 7.60 percent, improving model performance by 3.44 percent, and maintaining robust performance across various experimental settings. Code is available at https://github.com/PKU-YuanGroup/AsFT

🔍 Key Points

Introduction of AsFT (Anchoring Safety in Fine-Tuning) that integrates a regularization term to constrain fine-tuning updates within a narrow safety basin, leveraging the alignment direction for safety.
Detailed investigation into the safety landscape of LLMs, elucidating the concept of alignment direction and its significance in preserving model safety during updates.
Extensive experimental validation demonstrating that AsFT significantly reduces harmful outputs by 7.60% and improves model performance by 3.44%, outperforming existing approaches such as Safe LoRA and SafeInstr.
Proposed methodology effectively generalizes across multiple datasets and models, showing robustness against fine-tuning attacks and the ability to maintain both safety and performance concurrently.
Visualization of safety landscapes reinforces the understanding of safe perturbation spaces in LLMs, further emphasizing the structural properties of the narrow safety basin.

💡 Why This Paper Matters

The paper presents a novel and effective method for ensuring the safety of large language models during fine-tuning, which is critical given the increasing deployment of these models in sensitive applications. By framing the safety landscape and developing the AsFT approach, the authors provide insights and methodologies that are crucial for practitioners seeking to enhance LLM safety without sacrificing performance.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant for AI security researchers because it addresses the significant challenges of maintaining safety in large language models during their fine-tuning process. It provides a systematic approach to understanding and mitigating vulnerabilities that could lead to harmful model behaviors, thus directly contributing to the advancement of secure AI frameworks and practices.

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper