← Back to Library

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Authors: Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li, Alfy Samuel, Daben Liu

Published: 2026-02-24

arXiv ID: 2602.21346v1

Added to Library: 2026-02-26 03:01 UTC

Red Teaming

📄 Abstract

Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.

🔍 Key Points

  • The paper identifies the vulnerability of large language models (LLMs) to jailbreak attacks due to their reliance on shallow alignment mechanisms that lack deep reasoning.
  • It introduces a novel Chain-of-Thought (CoT) fine-tuning dataset that contains both utility-oriented and safety-critical prompts with step-by-step rationales, promoting principled model refusals based on reasoning.
  • The authors propose Alignment-Weighted Direct Preference Optimization (AW-DPO), a method that assigns different weights to reasoning and response segments to improve targeted updates and enhance robustness against jailbreak attacks.
  • Extensive experiments demonstrate that AW-DPO consistently enhances alignment robustness while maintaining model utility across safety and utility benchmarks.
  • The study empirically confirms that existing alignment methods are superficial, suggesting a need for more sophisticated techniques that incorporate deep reasoning into LLM safety alignment.

💡 Why This Paper Matters

This paper presents a significant advancement in improving the safety alignment of large language models by addressing their vulnerability to adversarial prompts through novel methodologies and datasets. By enhancing alignment with deep reasoning and introducing the Alignment-Weighted DPO approach, it sets a pathway for more robust model training and evaluation in safety-critical applications. The findings are crucial for ensuring responsible and secure deployment of AI systems in sensitive contexts.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper valuable as it tackles a pressing issue in the deployment of large language models, particularly their susceptibility to jailbreak attacks. The proposed methods, including a novel dataset and optimization approach, not only contribute to the understanding of safety alignment but also provide practical solutions that can be implemented to enhance the robustness of AI models. The research also opens avenues for future explorations in fine-tuning techniques and adversarial robustness, which are central themes in AI safety.

📚 Read the Full Paper