← Back to Library

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Authors: Thong Bach, Thanh Nguyen-Tang, Dung Nguyen, Thao Minh Le, Truyen Tran

Published: 2025-11-22

arXiv ID: 2511.18039v1

Added to Library: 2025-11-25 03:01 UTC

Safety

📄 Abstract

Fine-tuning Large Language Models (LLMs) for downstream tasks often compromises safety alignment, even when using parameter-efficient methods like LoRA. In this work, we uncover a notable property: fine-tuned models preserve the geometric structure of their loss landscapes concerning harmful content, regardless of the fine-tuning method employed. This suggests that safety behaviors are not erased but shifted to less influential regions of the parameter space. Building on this insight, we propose a curvature-aware alignment restoration method that leverages influence functions and second-order optimization to selectively increase loss on harmful inputs while preserving task performance. By navigating the shared geometry between base and fine-tuned models, our method discourages unsafe outputs while preserving task-relevant performance, avoiding full reversion and enabling precise, low-impact updates. Extensive evaluations across multiple model families and adversarial settings show that our approach efficiently reduces harmful responses while maintaining or even improving utility and few-shot learning performance.

🔍 Key Points

  • The authors reveal that the geometric structure of the loss landscape concerning harmful content remains preserved in fine-tuned large language models (LLMs), which can be exploited for safety restoration.
  • The proposed curvature-aware alignment restoration method effectively utilizes influence functions and second-order optimization techniques to enhance model responses by selectively increasing loss on harmful inputs without hindering overall task performance.
  • Extensive evaluations demonstrate that the proposed method significantly reduces harmful responses while maintaining or even enhancing the model's utility and few-shot learning performance across multiple model families and settings.
  • The framework's unique approach in leveraging preserved safety mechanisms provides a novel means of re-aligning LLM outputs without retraining, addressing the critical issue of safety degradation in fine-tuned models.
  • The method shows robustness against adversarial attacks and parameter perturbations through improved safety basin stability, emphasizing its practical viability in real-world applications.

💡 Why This Paper Matters

This paper addresses a significant gap in the safe deployment of large language models, presenting a novel method for restoring safety alignment during fine-tuning without compromising task-specific performance. Its findings have meaningful implications for enhancing the resilience of language models against harmful outputs while ensuring high utility in diverse applications, making it a crucial contribution to the field of AI safety and robustness.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is particularly relevant as it delves into the intricate relationship between model fine-tuning practices and safety alignment degradation. The proposed curvature-aware approach to safety restoration offers a new perspective on how to maintain safe behaviors in models, which is critical in high-stakes environments where harmful outputs can have serious consequences. Moreover, the robustness of the method against adversarial attacks underscores its potential utility in developing more secure AI systems.

📚 Read the Full Paper