SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

Authors: Qinjian Zhao, Jiaqi Wang, Zhiqiang Gao, Zhihao Dou, Belal Abuhaija, Kaizhu Huang

Published: 2025-09-30

arXiv ID: 2509.26345v1

Added to Library: 2025-10-01 04:01 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) have achieved impressive performance across diverse natural language processing tasks, but their growing power also amplifies potential risks such as jailbreak attacks that circumvent built-in safety mechanisms. Existing defenses including input paraphrasing, multi step evaluation, and safety expert models often suffer from high computational costs, limited generalization, or rigid workflows that fail to detect subtle malicious intent embedded in complex contexts. Inspired by cognitive science findings on human decision making, we propose SafeBehavior, a novel hierarchical jailbreak defense mechanism that simulates the adaptive multistage reasoning process of humans. SafeBehavior decomposes safety evaluation into three stages: intention inference to detect obvious input risks, self introspection to assess generated responses and assign confidence based judgments, and self revision to adaptively rewrite uncertain outputs while preserving user intent and enforcing safety constraints. We extensively evaluate SafeBehavior against five representative jailbreak attack types including optimization based, contextual manipulation, and prompt based attacks and compare it with seven state of the art defense baselines. Experimental results show that SafeBehavior significantly improves robustness and adaptability across diverse threat scenarios, offering an efficient and human inspired approach to safeguarding LLMs against jailbreak attempts.

🔍 Key Points

Introduction of SafeBehavior, a novel hierarchical defense mechanism against jailbreak attacks in LLMs that mimics human multistage reasoning.
SafeBehavior consists of intention inference, self-introspection, and self-revision stages, enhancing robustness and adaptability in detecting and mitigating complex jailbreak attacks.
Extensive experiments demonstrated SafeBehavior's effectiveness against five representative types of jailbreak attacks, outperforming seven state-of-the-art defense methods with near-zero attack success rates.

💡 Why This Paper Matters

This paper presents a significant advancement in safeguarding large language models (LLMs) against jailbreak attacks through the innovative SafeBehavior mechanism. By integrating a hierarchical, human-inspired reasoning process, the authors effectively address vulnerabilities that traditional defenses struggle to mitigate. This approach not only enhances the safety of LLMs but also ensures that their performance remains intact, making it a crucial development in the field of AI safety.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are highly relevant to AI security researchers as they address the critical and growing issue of safety in access to language models. The introduction of a systematic methodology to mitigate jailbreak attacks provides a framework for developing more secure AI systems. By simulating human cognitive processes, SafeBehavior offers insights that could lead to more nuanced and effective defenses, sparking further research and practical applications in the realm of AI safety.

SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper