← Back to Library

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Authors: Haibo Tong, Dongcheng Zhao, Guobin Shen, Xiang He, Dachuan Lin, Feifei Zhao, Yi Zeng

Published: 2025-09-25

arXiv ID: 2509.22732v1

Added to Library: 2025-09-30 04:05 UTC

Red Teaming Safety

📄 Abstract

The remarkable capabilities of Large Language Models (LLMs) have raised significant safety concerns, particularly regarding "jailbreak" attacks that exploit adversarial prompts to bypass safety alignment mechanisms. Existing defense research primarily focuses on single-turn attacks, whereas multi-turn jailbreak attacks progressively break through safeguards through by concealing malicious intent and tactical manipulation, ultimately rendering conventional single-turn defenses ineffective. To address this critical challenge, we propose the Bidirectional Intention Inference Defense (BIID). The method integrates forward request-based intention inference with backward response-based intention retrospection, establishing a bidirectional synergy mechanism to detect risks concealed within seemingly benign inputs, thereby constructing a more robust guardrails that effectively prevents harmful content generation. The proposed method undergoes systematic evaluation compared with a no-defense baseline and seven representative defense methods across three LLMs and two safety benchmarks under 10 different attack methods. Experimental results demonstrate that the proposed method significantly reduces the Attack Success Rate (ASR) across both single-turn and multi-turn jailbreak attempts, outperforming all existing baseline methods while effectively maintaining practical utility. Notably, comparative experiments across three multi-turn safety datasets further validate the proposed model's significant advantages over other defense approaches.

🔍 Key Points

  • Proposal of the Bidirectional Intention Inference Defense (BIID) that integrates forward request-based intention inference with backward response-based intention retrospection to enhance LLMs' defenses against multi-turn jailbreak attacks.
  • Systematic evaluations show that BIID substantially reduces the Attack Success Rate (ASR) for both single-turn and multi-turn jailbreak attempts, outperforming existing defense methods.
  • BIID significantly maintains the utility of the language models while providing superior safety performance, achieving a notable balance between model defense and operational efficiency.
  • The method has been thoroughly tested across three LLMs and various safety benchmarks, demonstrating cross-model effectiveness and robustness in diverse attack scenarios.
  • Analysis of intention detection phases reveals BIID's ability to effectively capture and neutralize hidden malicious intents within dialogues.

💡 Why This Paper Matters

This paper presents a significant advancement in the field of AI safety, particularly in combating advanced attacks on Large Language Models (LLMs). By introducing the Bidirectional Intention Inference Defense, the authors have laid the groundwork for a robust defense mechanism that effectively counters multi-turn jailbreak attacks, showcasing not only improved safety but also a retention of model utility. These contributions are crucial for the responsible deployment of LLMs in real-world applications where security is paramount.

🎯 Why It's Interesting for AI Security Researchers

The findings and methodologies presented in this paper are of great interest to AI security researchers as they address the critical and evolving threats posed by jailbreak attacks on LLMs. The introduction of a dynamic, bidirectional approach to intention inference offers a promising strategy for enhancing model safety. As these attacks become more sophisticated, the need for effective defense mechanisms like BIID will be essential for maintaining the integrity and reliability of AI systems in sensitive applications.

📚 Read the Full Paper