← Back to Library

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Authors: Zhenglin Lai, Mengyao Liao, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li, Bingzhe Wu

Published: 2025-06-20

arXiv ID: 2506.17368v1

Added to Library: 2025-06-24 04:01 UTC

Safety

📄 Abstract

Large language models based on Mixture-of-Experts have achieved substantial gains in efficiency and scalability, yet their architectural uniqueness introduces underexplored safety alignment challenges. Existing safety alignment strategies, predominantly designed for dense models, are ill-suited to address MoE-specific vulnerabilities. In this work, we formalize and systematically study MoE model's positional vulnerability - the phenomenon where safety-aligned behaviors rely on specific expert modules, revealing critical risks inherent to MoE architectures. To this end, we present SAFEx, an analytical framework that robustly identifies, characterizes, and validates the safety-critical experts using a novel Stability-based Expert Selection (SES) algorithm. Notably, our approach enables the explicit decomposition of safety-critical experts into distinct functional groups, including those responsible for harmful content detection and those controlling safe response generation. Extensive experiments on mainstream MoE models, such as the recently released Qwen3-MoE, demonstrated that their intrinsic safety mechanisms heavily rely on a small subset of positional experts. Disabling these experts significantly compromised the models' ability to refuse harmful requests. For Qwen3-MoE with 6144 experts (in the FNN layer), we find that disabling as few as 12 identified safety-critical experts can cause the refusal rate to drop by 22%, demonstrating the disproportionate impact of a small set of experts on overall model safety.

🔍 Key Points

  • Development of SAFEx, a systematic analytical framework designed to analyze positional vulnerabilities in Mixture-of-Experts (MoE) architecture.
  • Introduction of Stability-based Expert Selection (SES), a robust statistical method for identifying safety-critical experts in MoE-based models.
  • Discovery that the safety mechanisms of MoE models rely heavily on a small subset of positional experts, demonstrating a critical risk in current architectures.
  • Empirical evidence showing that disabling a minimal number of safety-critical experts can significantly impair the model's ability to refuse harmful requests, with a 22% decrease in refusal rates observed in experiments.
  • Formal categorization of experts into functional groups, elucidating their roles in harmful content identification and response control, paving the way for targeted safety alignments.

💡 Why This Paper Matters

This paper presents crucial insights into the vulnerabilities of MoE-based large language models, emphasizing the need for specialized safety alignment mechanisms. The findings point to the crucial dependence of model safety on a few positional experts, thereby calling for future research to focus on developing architectural improvements and alignment strategies that can bolster safety across these models. By systematically analyzing these vulnerabilities, the authors contribute significantly to the fields of model interpretability and safety in AI.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it highlights previously underexplored vulnerabilities in a widely adopted model architecture—Mixture-of-Experts. Understanding these vulnerabilities is critical for developing robust defenses against potential misuse of language models. The findings provide a basis for future research aimed at improving safety mechanisms in AI, making it essential reading for those concerned with the ethical deployment and robustness of AI systems.

📚 Read the Full Paper