← Back to Library

What Matters For Safety Alignment?

Authors: Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan

Published: 2026-01-07

arXiv ID: 2601.03868v2

Added to Library: 2026-02-25 03:02 UTC

Red Teaming

📄 Abstract

This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.

🔍 Key Points

  • The study identifies the safest large reasoning models (LRMs) and large language models (LLMs) based on their alignment against various adversarial attack methods, revealing the top three models and their capabilities.
  • It demonstrates that post-training methods and knowledge distillation, while often aimed at boosting performance, can lead to systematic degradation of safety alignment and urges for safety to be a core optimization objective during these processes.
  • The paper highlights the effectiveness of response prefix attacks over prompt suffix attacks, demonstrating a significant increase in attack success rates that exposes vulnerabilities in the safety mechanisms of deployed models.
  • The findings underscore that certain intrinsic model characteristics, such as integrated reasoning and self-reflection mechanisms, play a critical role in improving the safety alignment of advanced LLMs.
  • The research provides a comprehensive framework for evaluating safety alignment by leveraging multiple datasets and a broad range of models, ensuring robust empirical analysis.

💡 Why This Paper Matters

This paper is significant as it elucidates the critical factors impacting safety alignment in modern AI models. The findings provide essential insights that can guide the development of more secure and reliable AI systems, addressing the urgent need to protect against potential exploits in widely deployed LLMs and LRMs.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper relevant due to its systematic examination of safety alignment in AI models, shedding light on how intrinsic and extrinsic factors influence vulnerability to attacks. The demonstrated effectiveness of various attack strategies on popular models can help inform the design of future defensive measures, making it crucial for enhancing AI robustness.

📚 Read the Full Paper