Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

📄 Abstract

Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model's strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model's parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving utility on benign tasks with minimal over-refusal. ADA maintains this resilience even after the base model undergoes subsequent instruction tuning (benign or adversarial).

🔍 Key Points

Introduction of Any-Depth Alignment (ADA) as a novel method to enhance safety measures in Large Language Models (LLMs) by leveraging innate safety priors encoded in assistant header tokens.
ADA operates without altering the underlying model parameters, maintaining low inference costs while achieving near-100% refusal rates against sophisticated adversarial prefill attacks.
Empirical evidence shows that while traditional alignment methods collapse under deep prefill attacks, ADA effectively sustains safety at arbitrary generation depths, demonstrating robustness across a variety of model families.
The mechanism of reintroducing safety tokens mid-generation to trigger rethinking behavior showcases a new understanding of how LLMs encode internal safety signals, revealing insights into their operation and safety dynamics.
ADA's performance remains resilient even after subsequent fine-tuning, making it a suitable approach for real-time applications in AI systems requiring robust safety assurances.

💡 Why This Paper Matters

This paper is significant as it presents a transformative approach to safety alignment in LLMs, demonstrating that inherent safety mechanisms within these models can be unlocked and utilized effectively without altering their structure. The Any-Depth Alignment technique not only enhances model safety but also ensures operational efficiency, which is crucial for deploying AI systems in sensitive areas. The findings promote deeper understanding and practical advancements in AI safety frameworks, paving the way for safer implementations of LLMs.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it addresses critical vulnerabilities of LLMs in the context of adversarial prompts and harmful content generation. By proposing a method that maintains safety at any depth of text generation, it opens avenues for improving defenses against adversarial attacks—a pressing concern in AI deployment. The insights on extracting latent safety signals from LLMs also contribute to the development of more robust AI systems that can operate safely in real-world applications.

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper