UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

Authors: Yuhao Sun, Zhuoer Xu, Shiwen Cui, Kun Yang, Lingyun Yu, Yongdong Zhang, Hongtao Xie

Published: 2025-10-02

arXiv ID: 2510.02194v1

Added to Library: 2025-10-03 04:03 UTC

📄 Abstract

Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafe$^\circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while preserving general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafe$^\circ$C achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.

🔍 Key Points

The paper introduces a three-pronged jailbreak attack strategy specifically designed for fine-tuning large language models (LLMs) in a highly constrained black-box setting, focusing on evading defenses implemented by service providers.
The proposed attack employs prefix/suffix wrappers, benign lexical encodings (underscoring) of harmful keywords, and a trigger-based backdoor approach to effectively learn harmful behaviors while appearing innocuous in the dataset.
Experiments demonstrate the approach's effectiveness, achieving over 97% attack success rates when targeting GPT-4 models through real-world fine-tuning APIs, while maintaining the general utility of the LLMs.
The authors provide a detailed threat model and experimental setup that realistically represents the constraints of commercial fine-tuning environments, enhancing the relevance of their findings.
The study identifies significant gaps in current defense mechanisms against fine-tuning attacks, suggesting a need for improved end-to-end safety measures for model providers.

💡 Why This Paper Matters

This paper highlights a critical vulnerability in the safety alignment of fine-tuned models by demonstrating how attackers can effectively exploit these models even with robust filtering and auditing processes in place. The findings emphasize the need for more resilient defensive strategies for large language models, making it a vital contribution to the current discourse on AI safety.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly interesting as it not only uncovers existing vulnerabilities in widely used large language models but also provides a practical attack framework that challenges current defense methodologies. The insights gained from this research could inform the development of more robust countermeasures, thus advancing the overall security posture of AI systems.

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper