← Back to Library

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Authors: Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Xiaoyu You, Min Yang

Published: 2025-08-06

arXiv ID: 2508.04204v1

Added to Library: 2025-08-14 23:07 UTC

Red Teaming

📄 Abstract

Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Existing defense mechanisms, however, rely on costly fine-tuning and additional expert knowledge, which restricts their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs, which injects timely safety aha moments to steer harmless while helpful reasoning processes. Leveraging the model's internal attention behavior, our approach accurately identifies critical points in the reasoning path, and triggers spontaneous, safety-oriented reflection. To safeguard both the subsequent reasoning steps and the final answers, we further implement a scaling sampling strategy during the decoding phase, selecting the optimal reasoning path. Inducing minimal extra inference cost, ReasoningGuard effectively mitigates three types of jailbreak attacks, including the latest ones targeting the reasoning process of LRMs. Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses while effectively avoiding the common exaggerated safety issues.

🔍 Key Points

  • Introduction of ReasoningGuard, a lightweight inference-time safeguard that does not require extensive fine-tuning or expert guidance to defend Large Reasoning Models (LRMs) against jailbreak attacks.
  • Utilization of the model's internal attention behavior to identify critical intervention points during the reasoning process, enabling timely and contextual safety prompts that trigger reflection on harmful content generation.
  • Implementation of a scaling sampling strategy to select optimal reasoning paths following the safety interventions, ensuring that effective and safe content is generated without significant extra inference costs.
  • Demonstration of superior performance of ReasoningGuard against seven existing methods across multiple benchmarks, effectively mitigating harmful outputs while maintaining model utility in reasoning tasks.
  • Addressing common issues with exaggerated safety depriving useful model outputs by providing a balanced safety-utility approach.

💡 Why This Paper Matters

This paper presents a significant advancement in the safety mechanism for Large Reasoning Models, addressing a critical gap in safeguarding AI outputs from jailbreak attacks without the need for burdensome retraining or expert interventions. The ReasoningGuard architecture combines innovative techniques to enhance model safety while preserving performance, making it a pivotal contribution in the evolving landscape of AI responsible use.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers as it tackles the pressing challenge of harmful content generation in AI systems, proposing novel methodologies that improve safety without compromising usability. Its empirical evaluations against contemporary threats highlight practical implications for deploying secure AI solutions in real-world applications.

📚 Read the Full Paper