← Back to Library

RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

Authors: Jiacheng Liang, Yuhui Wang, Tanqiu Jiang, Ting Wang

Published: 2026-02-04

arXiv ID: 2602.04448v1

Added to Library: 2026-02-05 03:02 UTC

Red Teaming

📄 Abstract

Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their sparse routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while preserving general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-preserving alternative to prior approaches.

🔍 Key Points

  • Introduction of RASA, a routing-aware expert-level alignment framework specifically designed for Mixture-of-Experts (MoE) models to enhance safety alignment.
  • RASA targets and repairs Safety-Critical Experts while mitigating failures associated with full-parameter fine-tuning approaches, which often rely on routing shortcuts.
  • Demonstration of RASA's effectiveness across multiple MoE architectures, achieving near-perfect robustness against diverse jailbreak attacks and strong cross-attack generalization.
  • Experimental results show that RASA maintains general model capabilities on standard benchmarks (MMLU, GSM8K, TruthfulQA) while controlling over-refusal rates in response to harmful queries.
  • The framework leverages targeted expert repair instead of global parameter updates, providing a practical and architecture-preserving solution for safety alignment.

💡 Why This Paper Matters

This paper presents RASA, a significant advancement in the domain of safety alignment for language models, addressing the unique challenges posed by Mixture-of-Experts architectures. By focusing on a targeted and structured approach to align safety-critical components, RASA enhances the robustness of MoE models against adversarial attacks without compromising their operational utility. The findings underscore the importance of expert-level interventions for effective safety management in advanced AI systems.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper compelling as it addresses critical vulnerabilities in language models, particularly those utilizing MoE architectures. The work provides novel methodologies for safety alignment that are essential for developing defenses against increasingly sophisticated adversarial attacks. Furthermore, it highlights the balance between safety and performance—a recurring challenge in AI—making it valuable for professionals working on secure AI deployment.

📚 Read the Full Paper