← Back to Library

What Matters For Safety Alignment?

Authors: Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan

Published: 2026-01-07

arXiv ID: 2601.03868v1

Added to Library: 2026-01-08 03:01 UTC

Red Teaming

📄 Abstract

This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.

🔍 Key Points

  • Identified the top safest models in safety alignment: GPT-OSS-20B, Qwen3-Next-80B-A3B, and GPT-OSS-120B, demonstrating the effectiveness of integrated reasoning mechanisms for safer LLMs/LRMs.
  • Established that post-training and knowledge distillation can degrade safety alignment, recommending that safety become a core optimization objective rather than secondary to performance.
  • Revealed vulnerabilities in LLMs, particularly through the response prefix method that can increase attack success rates significantly, illustrating the risks present in text-completion interfaces.
  • Highlighted predominant attack methodologies like roleplay and prompt injection which can elicit unaligned behaviors in LLMs, necessitating stronger safeguards.

💡 Why This Paper Matters

This paper is crucial as it provides empirical evidence and analysis on the safety alignment of modern AI models, particularly important in an era of rapidly evolving LLM capabilities. It addresses both the internal and external factors that endanger the security of LLMs and LRMs, proposing actionable insights for mitigating these risks effectively. This insight is vital for developing AI systems that not only perform well but also adhere to ethical and safety standards, thereby gaining user trust.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper presents significant findings regarding model vulnerabilities and attack strategies. It emphasizes the importance of safety alignment in LLMs and LRMs, making it a critical resource for understanding and enhancing the security of AI systems. The exploration of attack methodologies also provides a foundation for developing advanced defensive techniques, making it relevant for ongoing research in AI safety and security.

📚 Read the Full Paper