← Back to Library

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Authors: Birong Pan, Mayi Xu, Qiankun Pi, Jianhao Chen, Yuanyuan Zhu, Ming Zhong, Tieyun Qian

Published: 2025-08-13

arXiv ID: 2508.09473v1

Added to Library: 2025-08-14 23:15 UTC

Safety

📄 Abstract

Ensuring robust safety alignment while preserving utility is critical for the reliable deployment of Large Language Models (LLMs). However, current techniques fundamentally suffer from intertwined deficiencies: insufficient robustness against malicious attacks, frequent refusal of benign queries, degradation in generated text quality and general task performance--the former two reflecting deficits in robust safety and the latter constituting utility impairment. We trace these limitations to the coarse-grained layer-wise interventions in existing methods. To resolve this, we propose NeuronTune, a fine-grained framework that dynamically modulates sparse neurons to achieve simultaneous safety-utility optimization. Our approach first identifies safety-critical and utility-preserving neurons across all layers via attribution, then employs meta-learning to adaptively amplify safety-neuron activations and suppress utility-neuron activations. Crucially, NeuronTune enables tunable adjustment of intervention scope via neuron-count thresholds, supporting flexible adaptation to security-critical or utility-priority scenarios. Extensive experimental results demonstrate that our method significantly outperforms existing state-of-the-art technologies, achieving superior model safety while maintaining excellent utility.

🔍 Key Points

  • Introduction of NeuronTune, a fine-grained modulation framework that targets specific neurons for optimal safety-utility alignment in large language models (LLMs).
  • Identification of safety-critical and utility-preserving neurons using an attack-aware attribution method, allowing for precise neuron selection based on adversarial vulnerabilities.
  • Adaptive activation adjustment via meta-learning to enhance safety neurons and suppress utility neurons, demonstrating a flexible balance between safety and performance.
  • Extensive empirical results showing NeuronTune outperforming state-of-the-art techniques in maintaining superior safety without sacrificing utility across various benchmarks.
  • Tunable intervention mechanism allowing dynamic adjustment of neuron counts for specific deployment scenarios, catering to varying safety and utility demands.

💡 Why This Paper Matters

This paper is significant as it presents a novel approach to balancing safety and utility in LLMs, addressing a critical issue faced by current methods. By introducing NeuronTune, the authors provide a targeted mechanism that enhances model performance against adversarial attacks while reducing over-caution in benign contexts. This advancement is crucial for the safe deployment of LLMs in real-world applications, making it a vital contribution to the field of AI safety.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are of great interest to AI security researchers as they focus on improving the safety of LLMs—a prominent concern given the rise of adversarial attacks and harmful content generation. Understanding how to pinpoint and modulate neurons responsible for safety and utility could lead to more robust defense strategies against malicious exploitation, thereby enhancing the security posture of AI systems.

📚 Read the Full Paper