ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

📄 Abstract

Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this work, we present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks. The core idea of ReasAlign is to incorporate structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks to defend against indirect injection attacks. To further ensure reasoning logic and accuracy, we introduce a test-time scaling mechanism with a preference-optimized judge model that scores reasoning steps and selects the best trajectory. Comprehensive evaluations across various benchmarks show that ReasAlign maintains utility comparable to an undefended model while consistently outperforming Meta SecAlign, the strongest prior guardrail. On the representative open-ended CyberSecEval2 benchmark, which includes multiple prompt-injected tasks, ReasAlign achieves 94.6% utility and only 3.6% ASR, far surpassing the state-of-the-art defensive model of Meta SecAlign (56.4% utility and 74.4% ASR). These results demonstrate that ReasAlign achieves the best trade-off between security and utility, establishing a robust and practical defense against prompt injection attacks in real-world agentic systems. Our code and experimental results could be found at https://github.com/leolee99/ReasAlign.

🔍 Key Points

Introduction of ReasAlign, a model-level defence mechanism designed to enhance safety alignment against prompt injection attacks in Large Language Models (LLMs).
Emphasis on structured reasoning processes to analyze user queries and detect conflicting instructions, ensuring continuity of user intent while defending against injected prompts.
Implementation of a test-time scaling mechanism using a preference-optimized judge model which selects the best reasoning trajectory, enhancing reasoning accuracy and efficiency.
Empirical results show that ReasAlign achieves a utility of 94.6% and an Attack Success Rate (ASR) of only 3.6% on the CyberSecEval2 benchmark, significantly outperforming prior protective models such as Meta SecAlign.
Validation of the the reasoning component through ablation studies, demonstrating its critical role in improving security with minimal impact on utility.

💡 Why This Paper Matters

This paper is significant as it presents a novel approach to security in LLMs through reasoning-enhanced safety alignment, addressing the growing risk of prompt injection attacks. By successfully achieving high utility while minimizing security vulnerabilities, ReasAlign sets a new standard for safe AI-driven systems in various applications. Its methodological innovations and practical implications serve as a key advancement for the fields of AI safety and security.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are crucial for AI security researchers because prompt injection attacks pose a significant threat to the integrity of LLMs. The novel techniques introduced, such as structured reasoning and test-time scaling, provide a robust framework for addressing these vulnerabilities, offering valuable insights for developing proactive defense mechanisms in AI systems.

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper