Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

📄 Abstract

Recent large language models (LLMs) have increasingly adopted the Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily depend on a superficial safety mechanism in which harmful inputs are routed safety-critical experts. However, our analysis reveals that routing decisions for harmful inputs drift significantly after fine-tuning, exposing a critical vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses, primarily designed for monolithic LLMs, are less effective for MoE LLMs as they fail to prevent drift in harmful input routing. To address this limitation, we propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model, thereby preserving the safety-aligned routing of harmful inputs to safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to 141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while maintaining task utility within 1% degradation and incurring only 2% overhead. It significantly outperforms state-of-the-art defense methods for safeguarding LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as gpt-oss and Llama 4. Our implementation is available at https://anonymous.4open.science/r/SafeMoE.

🔍 Key Points

The paper identifies a vulnerability in the safety mechanisms of Mixture-of-Experts (MoE) LLMs, where harmful inputs can inadvertently route away from safety-critical experts after fine-tuning, leading to increased risks of harmful outputs.
The proposed method, SafeMoE, introduces a unique fine-tuning approach that employs a routing drift regularization technique to maintain the alignment of routing decisions minimized by KL-divergence between fine-tuned models and their safety-aligned counterparts.
Extensive experiments across a range of MoE LLMs show that SafeMoE significantly reduces harmfulness scores while only incurring minimal degradation in task utility, outperforming existing state-of-the-art defenses designed for monolithic LLMs.
The method is shown to be efficient, with only a 2% overhead in training time, making it practical for real-world applications of MoE LLMs.
SafeMoE incorporates bi-level greedy optimization to alternate between regularization and task-specific fine-tuning, maintaining a balance between safety and model performance.

💡 Why This Paper Matters

This paper is critical as it addresses a significant and previously underexplored vulnerability in MoE architectures where routing decisions can degrade safety during fine-tuning. By proposing SafeMoE, the authors provide an effective defense strategy that not only maintains but enhances the safety of these models while ensuring their task utility remains intact, which is essential for the deployment of LLMs in safety-critical applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant for AI security researchers as it outlines an innovative approach to safeguarding large language models against harmful fine-tuning attacks. The identification of safety routing drift presents a new area of concern within model safety, and the techniques introduced in this work contribute valuable insights and practical methodologies to enhance the robustness of AI systems against adversarial manipulations and unintended behavior.

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper