Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

📄 Abstract

The rapid adoption of Mixture-of-Experts (MoE) architectures marks a major shift in the deployment of Large Language Models (LLMs). MoE LLMs improve scaling efficiency by activating only a small subset of parameters per token, but their routing structure introduces new safety attack surfaces. We find that safety-critical behaviors in MoE LLMs (e.g., refusal) are concentrated in a small set of experts rather than being uniformly distributed. Building on this, we propose Large Language Lobotomy (L$^3$), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics. L$^3$ learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced. We evaluate L$^3$ on eight state-of-the-art open-source MoE LLMs and show that our adaptive expert silencing increases average attack success from 7.3% to 70.4%, reaching up to 86.3%, outperforming prior training-free MoE jailbreak methods. Moreover, bypassing guardrails typically requires silencing fewer than 20% of layer-wise experts while largely preserving general language utility. These results reveal a fundamental tension between efficiency-driven MoE design and robust safety alignment and motivate distributing safety mechanisms more robustly in future MoE LLMs with architecture- and routing-aware methods.

🔍 Key Points

Introduction of Large Language Lobotomy (L3), a training-free and architecture-agnostic method for jailbreaking Mixture-of-Experts (MoE) models by silencing safety-critical experts.
L3 assessment demonstrated an increase in average attack success rate from 7.3% to 70.4%, with some models achieving up to 86.3%, indicating the vulnerability of safety mechanisms in MoE architectures.
Discovery that safety-critical behaviors in MoE LLMs are concentrated in a limited number of experts, creating a structural vulnerability that attackers can exploit without fine-tuning the models.
Validation of L3 against other methods like GateBreaker, showing L3's superior efficacy in terms of adaptive expert silencing and better preservation of the model's overall utility.
Presentation of the fundamental tension between efficiency-driven MoE design and the need for robust safety alignment, suggesting architectural changes for future MoE models.

💡 Why This Paper Matters

This paper is significant because it identifies critical vulnerabilities in Mixture-of-Experts architectures that could be exploited to bypass safety mechanisms in large language models. The introduction of L3 not only exposes these vulnerabilities but also provides a novel and effective method for attackers, further emphasizing the need for improved architectural designs that prioritize safety alongside efficiency. As large language models continue to be integrated into various applications, understanding and mitigating these vulnerabilities is imperative for ensuring responsible AI deployment.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant for AI security researchers as it illuminates the specific weaknesses inherent in the routing mechanisms of Mixture-of-Experts architectures. The techniques discussed, particularly the L3 framework, can guide future research in adversarial attacks and defenses tailored to LLMs and MoE models, a rapidly growing area in AI. Additionally, emphasizing the relationship between model architecture and safety highlights the need for security-by-design principles in AI development, fostering a more secure environment for deploying advanced AI systems.

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper