← Back to Library

ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

Authors: Zhengyue Zhao, Yingzi Ma, Somesh Jha, Marco Pavone, Chaowei Xiao

Published: 2025-07-14

arXiv ID: 2507.11500v1

Added to Library: 2025-07-16 04:01 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) have demonstrated remarkable generative capabilities. However, their susceptibility to misuse has raised significant safety concerns. While post-training safety alignment methods have been widely adopted, LLMs remain vulnerable to malicious instructions that can bypass safety constraints. Recent efforts have introduced inference-time safety reasoning (system-2 alignment), where LLMs conduct a reasoning process to perform safety verification before final response. We show, however, that these checks are driven by ad-hoc reasoning that diverges from the structured human process, where they first discern a user's true intent, then evaluate the associated risk based on the true intent. Consequently, these defenses remain vulnerable to sophisticated jailbreak prompts that cloak harmful goals in seemingly benign language. To build secure and safe LLMs, we propose a reasoning-based safety alignment framework, ARMOR, that replaces the ad-hoc chains of thought reasoning process with human-aligned, structured one. At inference, ARMOR (1) detects likely jailbreak strategies, (2) extracts the user's core intent while discarding deceptive instructions, and (3) applies a policy-grounded safety analysis to the purified request. ARMOR is evaluated on adaptive jailbreak attacks and multiple safety benchmarks, and a test-time scaling is conducted to further improve its performance. Results demonstrate that ARMOR significantly enhances the robustness against state-of-the-art adaptive jailbreak attacks and outperforms recent reasoning-based aligned models across various safety benchmarks.

🔍 Key Points

  • ARMOR proposes a structured reasoning-based safety alignment framework for large language models (LLMs) that enhances their robustness against adaptive jailbreak attacks.
  • The framework introduces meticulous reasoning steps that first identify a user's core intent and analyze potential jailbreak strategies, significantly improving safety evaluation.
  • Experimental results show that ARMOR drastically reduces the attack success rate from adaptive jailbreak techniques, such as AutoDAN-Turbo and Adversarial Reasoning, outperforming existing reasoning-based models across multiple safety benchmarks.
  • The model retains high general utility, demonstrating that safety enhancements do not compromise overall performance in generative capabilities.
  • ARMOR's test-time scaling capability allows it to dynamically adapt and defend against newly emerging jailbreak strategies through ongoing updates to its strategy library.

💡 Why This Paper Matters

This paper is highly relevant as it presents a pioneering approach to the critical issue of safety in large language models, providing a structured method to enhance reasoning about user intent and risk assessment. ARMOR's contributions lay the groundwork for future model safety advancements while maintaining high performance, making it an essential read for those invested in responsible AI deployment.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of interest as it tackles the urgent problem of LLM vulnerabilities, particularly against sophisticated jailbreak attacks. The proposed methods and insights on reasoning structures and threat detection mechanisms provide valuable strategies that can inform the development of more secure AI systems, contributing directly to the field of AI safety and security.

📚 Read the Full Paper