Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Authors: Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng

Published: 2025-08-08

arXiv ID: 2508.09190v1

Added to Library: 2025-08-14 23:15 UTC

Safety

📄 Abstract

Fine-tuning as service injects domain-specific knowledge into large language models (LLMs), while challenging the original alignment mechanisms and introducing safety risks. A series of defense strategies have been proposed for the alignment, fine-tuning, and post-fine-tuning phases, where most post-fine-tuning defenses rely on coarse-grained safety layer mapping. These methods lack a comprehensive consideration of both safety layers and fine-grained neurons, limiting their ability to efficiently balance safety and utility. To address this, we propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks. FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons while minimizing interference with downstream task neurons. We then project the safety neuron parameters onto safety directions, improving model safety while aligning more closely with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, by introducing a task-specific, multi-dimensional heterogeneous safety neuron cluster optimization mechanism, we achieve continual defense and generalization capability against unforeseen emerging safety concerns.

🔍 Key Points

Introduction of Fine-Grained Safety Neurons (FGSN), enabling precise identification of safety neurons while minimizing interference with general tasks in LLMs.
Proposes a Training-Free Continual Projection method that allows for ongoing adaptation to emerging safety dimensions without extensive parameter changes.
Demonstrated through experiments that FGSN significantly lowers harmfulness scores and attack success rates across multiple LLMs while maintaining model utility.
Highlights the importance of multi-scale interactions between safety neurons and safety layers, establishing a framework for effective safety alignment in LLMs.
Achieved continual safety improvements with minimal additional training overhead, showing that safety enhancements can be performed efficiently.

💡 Why This Paper Matters

This paper presents a groundbreaking approach to enhancing the safety of large language models (LLMs) by introducing Fine-Grained Safety Neurons and a Training-Free Continual Projection method. The significant reduction in harmfulness scores, alongside preserved model utility, illustrates its potential impact on deploying LLMs for real-world applications where safety is paramount. By effectively balancing safety with functionality, this research contributes to the robust alignment of AI systems with ethical considerations.

🎯 Why It's Interesting for AI Security Researchers

This paper will interest AI security researchers as it addresses the increasing safety concerns associated with fine-tuning large language models. The novel approach for continual safety improvement shows promise in not just addressing current safety issues but also in adapting to emerging risks. The methodology and findings can inform ongoing efforts in AI safety, guiding the development of more resilient models and better safeguarding practices.

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper