NeST: Neuron Selective Tuning for LLM Safety

📄 Abstract

Safety alignment is essential for the responsible deployment of large language models (LLMs). Yet, existing approaches often rely on heavyweight fine-tuning that is costly to update, audit, and maintain across model families. Full fine-tuning incurs substantial computational and storage overhead, while parameter-efficient methods such as LoRA trade efficiency for inconsistent safety gains and sensitivity to design choices. Safety intervention mechanisms such as circuit breakers reduce unsafe outputs without modifying model weights, but do not directly shape or preserve the internal representations that govern safety behavior. These limitations hinder rapid and reliable safety updates, particularly in settings where models evolve frequently or must adapt to new policies and domains. We present NeST, a lightweight, structure-aware safety alignment framework that strengthens refusal behavior by selectively adapting a small subset of safety-relevant neurons while freezing the remainder of the model. NeST aligns parameter updates with the internal organization of safety behavior by clustering functionally coherent safety neurons and enforcing shared updates within each cluster, enabling targeted and stable safety adaptation without broad model modification or inference-time overhead. We benchmark NeST against three dominant baselines: full fine-tuning, LoRA-based fine-tuning, and circuit breakers across 10 open-weight LLMs spanning multiple model families and sizes. Across all evaluated models, NeST reduces the attack success rate from an average of 44.5% to 4.36%, corresponding to a 90.2% reduction in unsafe generations, while requiring only 0.44 million trainable parameters on average. This amounts to a 17,310x decrease in updated parameters compared to full fine-tuning and a 9.25x reduction relative to LoRA, while consistently achieving stronger safety performance for alignment.

🔍 Key Points

Introduction of NeST, a neuron-structured framework for safety alignment that selectively fine-tunes a small subset of safety-relevant neurons while freezing the rest of the model, achieving robust safety improvements with dramatically fewer parameters.
Demonstration of NeST's effectiveness with empirical benchmarking against full fine-tuning, LoRA, and Circuit Breaker methods across diverse large language models, reducing attack success rates from 44.5% to 4.36% while minimizing parameter updates to an average of just 0.44 million.
Implementation of a unique clustering mechanism for safety neurons based on activation patterns, allowing for coordinated updates that preserve the functional coherence of safety behavior during adaptation.
Evaluation across different multimodal inference settings shows NeST’s stability and effectiveness, achieving consistent safety improvements even in diverse application environments (text-only, image, reasoning-augmented generation).
Utility analysis confirms that NeST maintains model performance across various reasoning tasks, indicating that the adaptation process does not significantly degrade the core abilities of the models while enhancing safety alignment.

💡 Why This Paper Matters

This paper is highly relevant in the context of AI safety and security as it presents a novel approach to safety alignment in large language models. By introducing the NeST framework, which strategically adapts only relevant neuron groups for safety alignment, the authors provide an efficient solution to the growing need for safe AI systems without incurring the heavy overhead typically associated with full model fine-tuning. This work contributes significantly to the ongoing efforts to ensure reliable AI deployment while maintaining performance, which is crucial as AI applications continue to proliferate.

🎯 Why It's Interesting for AI Security Researchers

The findings from this paper would be of great interest to AI security researchers because it addresses the critical challenge of safety alignment in large language models—a key concern for responsible AI usage. NeST's innovative approach to selectively fine-tune neurons involved in safety behaviors offers a new direction for developing robust defenses against adversarial prompting, which can lead to harmful outputs. This research not only enhances our understanding of safety mechanisms in neural networks but also provides practical methods that could be adopted in real-world AI systems to improve safety without compromising their functionality.

NeST: Neuron Selective Tuning for LLM Safety

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper