← Back to Library

Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Authors: Yuhang Wang, Yanxu Zhu, Dongyuan Lu, Jitao Sang

Published: 2025-11-26

arXiv ID: 2511.21214v2

Added to Library: 2025-12-01 03:01 UTC

📄 Abstract

Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models' ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.

🔍 Key Points

  • The study presents a novel 'Narrow Safety Proxy' model that can predict the Attack Success Rate (ASR) of adversarial prompts against large language models (LLMs), demonstrating the distillability of LLM security logic.
  • An improved 'Outline Filling Attack' method is introduced, allowing for dense sampling of model responses by breaking down dangerous queries into structured outlines, enhancing the diversity and effectiveness of attack prompts.
  • The use of a 'Ranking Regression' paradigm to replace traditional regression allows the proxy model to effectively predict the relative safety of prompts against domain shifts in ASR across different topics.
  • The paper achieves impressive empirical results, with the proxy model obtaining a 91.1% accuracy in predicting relative rankings of long responses and a 69.2% accuracy in predicting ASR, validating the predictability of jailbreak behaviors.
  • The findings highlight significant implications for optimizing black-box attacks, informing future defensive strategies against such vulnerabilities in LLMs.

💡 Why This Paper Matters

This paper is highly relevant as it uncovers the underlying security mechanisms of LLMs through innovative methodologies, illustrating how attackers can predict and exploit vulnerabilities. By establishing the effectiveness of the proposed safety proxy and ranking regression framework, it contributes significantly to the discourse on AI security and adversarial strategies.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper essential for understanding the vulnerabilities inherent in large language models, particularly in black-box settings. The methodologies and findings discussed could guide the development of more robust models and defenses against emergent adversarial tactics, making it a critical resource for ongoing research into securing AI systems.

📚 Read the Full Paper