← Back to Library

IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

Authors: Yuanzhe Shen, Zisu Huang, Zhengkang Guo, Yide Liu, Guanxu Chen, Ruicheng Yin, Xiaoqing Zheng, Xuanjing Huang

Published: 2025-08-27

arXiv ID: 2508.20151v1

Added to Library: 2025-08-29 04:01 UTC

Red Teaming

📄 Abstract

The rapid advancement of large language models (LLMs) has driven their adoption across diverse domains, yet their ability to generate harmful content poses significant safety challenges. While extensive research has focused on mitigating harmful outputs, such efforts often come at the cost of excessively rejecting harmless prompts. Striking a balance among safety, over-refusal, and utility remains a critical challenge. In this work, we introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning, multi-level safety classification, and query rewriting to neutralize potentially harmful intent in edge-case queries. Specifically, we first construct a comprehensive dataset comprising approximately 163,000 queries, each annotated with intent reasoning, safety labels, and rewritten versions. Supervised fine-tuning is then applied to equip the guard model with foundational capabilities in format adherence, intent analysis, and safe rewriting. Finally, we apply a tailored multi-reward optimization strategy that integrates rule-based heuristics and reward model signals within a reinforcement learning framework to further enhance performance. Extensive experiments show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios, significantly enhancing safety while effectively reducing over-refusal rates and improving the quality of responses.

🔍 Key Points

  • Introduction of IntentionReasoner, a novel safeguard mechanism for LLMs that balances safety and utility.
  • Development of a comprehensive dataset with 163,000 queries annotated for intent reasoning, safety classification, and rewriting.
  • Implementation of a dual-stage training approach combining supervised fine-tuning and reinforcement learning to enhance safety and usability.
  • Demonstration of significant improvements across safeguard benchmarks, including reduced over-refusal rates and jailbreak attack resilience.
  • Discovery of effective intent-based classification allowing for more granular safety assessments beyond binary categories.

💡 Why This Paper Matters

The relevance of this paper lies in its innovative approach to enhancing the safety mechanisms of large language models while minimizing the negative impacts of cautionary measures. By introducing IntentionReasoner, the authors provide a structured methodology for managing risks associated with AI outputs, establishing a new standard for safety in LLM applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper particularly interesting due to its focus on proactive risk mitigation strategies within language models. The proposed methods address key challenges in AI safety, such as balancing between filtering harmful content and avoiding unnecessary refusals of benign interactions, an area with substantial implications for real-world applications of AI technologies.

📚 Read the Full Paper