← Back to Library

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Authors: Gabrel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong

Published: 2025-06-18

arXiv ID: 2506.15606v1

Added to Library: 2025-06-19 03:00 UTC

Safety

πŸ“„ Abstract

Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model's adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at github.com/VITA-Group/LoX.

πŸ” Key Points

  • Introduction of the Low-Rank Extrapolation (LoX) method to enhance safety robustness of Large Language Models (LLMs) post-fine-tuning.
  • Empirical demonstration that fine-tuning reduces safety robustness by disrupting low-rank subspaces critical for LLM safety.
  • Extensive experiments showing that LoX achieves substantial reductions (11% to 54%) in Attack Success Rate (ASR) against both benign and malicious fine-tuning without compromising adaptability to new tasks.
  • Proposed metrics to quantify safety knowledge retention in the context of fine-tuning, aiding in the understanding of safety subspace dynamics.
  • Insights into how LoX shifts the model parameters into flatter regions of the safety landscape, thereby increasing robustness against perturbations.

πŸ’‘ Why This Paper Matters

This paper provides a significant contribution to the field of AI safety by proposing a novel, effective method (LoX) for enhancing the safety of LLMs against fine-tuning attacks. The findings indicate that careful manipulation of low-rank parameters can preserve safety without sacrificing model performance, highlighting a practical approach for deploying LLMs in sensitive real-world applications.

🎯 Why It's Interesting for AI Security Researchers

This research is of paramount interest to AI security researchers as it addresses a critical vulnerability in LLMsβ€”the susceptibility of aligned models to safety degradation through fine-tuning. The methods and insights presented pave the way for developing more secure AI systems capable of maintaining high safety standards in the face of potential misuse, an essential goal in AI deployment.

πŸ“š Read the Full Paper