← Back to Library

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Authors: Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, Yang Zhang

Published: 2026-02-09

arXiv ID: 2602.08621v1

Added to Library: 2026-02-10 05:01 UTC

Safety

📄 Abstract

By introducing routers to selectively activate experts in Transformer layers, the mixture-of-experts (MoE) architecture significantly reduces computational costs in large language models (LLMs) while maintaining competitive performance, especially for models with massive parameters. However, prior work has largely focused on utility and efficiency, leaving the safety risks associated with this sparse architecture underexplored. In this work, we show that the safety of MoE LLMs is as sparse as their architecture by discovering unsafe routes: routing configurations that, once activated, convert safe outputs into harmful ones. Specifically, we first introduce the Router Safety importance score (RoSais) to quantify the safety criticality of each layer's router. Manipulation of only the high-RoSais router(s) can flip the default route into an unsafe one. For instance, on JailbreakBench, masking 5 routers in DeepSeek-V2-Lite increases attack success rate (ASR) by over 4$\times$ to 0.79, highlighting an inherent risk that router manipulation may naturally occur in MoE LLMs. We further propose a Fine-grained token-layer-wise Stochastic Optimization framework to discover more concrete Unsafe Routes (F-SOUR), which explicitly considers the sequentiality and dynamics of input tokens. Across four representative MoE LLM families, F-SOUR achieves an average ASR of 0.90 and 0.98 on JailbreakBench and AdvBench, respectively. Finally, we outline defensive perspectives, including safety-aware route disabling and router training, as promising directions to safeguard MoE LLMs. We hope our work can inform future red-teaming and safeguarding of MoE LLMs. Our code is provided in https://github.com/TrustAIRLab/UnsafeMoE.

🔍 Key Points

  • Introduction of the Router Safety importance score (RoSais) to assess and quantify safety risks in MoE LLMs, identifying safety-critical routers whose manipulation can significantly increase harmful output probabilities.
  • Development of the Fine-grained token-layer-wise Stochastic Optimization framework (F-SOUR) that dynamically identifies unsafe routes within MoE architecture, resulting in increased attack success rates (ASR) across multiple models.
  • Empirical evidence demonstrating the sparse safety of MoE LLMs through experiments showing substantial increases in ASR when manipulating high-RoSais routers, effectively flipping model outputs from safe to harmful.
  • Discussion of practical defensive strategies against identified vulnerabilities, such as safety-aware route disabling and router training to enhance the safety of MoE LLMs.
  • Comprehensive evaluation of four MoE LLM families, demonstrating the applicability of the proposed methods across different architectures.

💡 Why This Paper Matters

The paper presents a critical analysis of safety risks associated with Mixture-of-Experts large language models, providing novel methodologies to quantify and exploit these vulnerabilities. By introducing RoSais and F-SOUR, the authors offer a clear framework to understand and reveal the underlying structural safety issues, establishing the importance of proactive safety evaluations in AI systems. This research is a significant contribution toward enhancing the safety of sophisticated LLM architectures, an area crucial given the potential misuse of AI technologies.

🎯 Why It's Interesting for AI Security Researchers

This paper is of paramount interest to AI security researchers because it addresses a pressing issue: the vulnerabilities of advanced AI systems like MoE LLMs under malicious manipulation. In an era where language models are deployed in critical applications, understanding and mitigating safety risks is essential. The innovative approaches introduced could serve as foundational tools for future research, enabling better evaluation and fortification against adversarial attacks. As AI systems become increasingly integrated into society, ensuring their alignment with safety and ethical standards is an ongoing necessity that this paper helps to address.

📚 Read the Full Paper