← Back to Library

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

Authors: Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fuli Feng, Xiangnan He

Published: 2025-12-08

arXiv ID: 2512.07761v1

Added to Library: 2025-12-09 04:00 UTC

Red Teaming

📄 Abstract

Large language models are vulnerable to jailbreak attacks, threatening their safe deployment in real-world applications. This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models through a sequence of prompt-output interactions. Existing approaches typically rely on single turn optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate the problem as a multi-turn reinforcement learning task, directly optimizing the harmfulness of the final-turn output as the outcome reward. To mitigate sparse supervision and promote long-term attack strategies, we propose two heuristic process rewards: (1) controlling the harmfulness of intermediate outputs to prevent triggering the black-box model's rejection mechanisms, and (2) maintaining the semantic relevance of intermediate outputs to avoid drifting into irrelevant content. Experimental results on multiple benchmarks show consistently improved attack success rates across multiple models, highlighting the effectiveness of our approach. The code is available at https://github.com/xxiqiao/RL-MTJail. Warning: This paper contains examples of harmful content.

🔍 Key Points

  • Introduction of RL-MTJail, a reinforcement learning framework specifically designed for black-box multi-turn jailbreak attacks on large language models, optimizing for harmful output across multiple turns.
  • Formulation of two heuristic process rewards: over-harm mitigation and target-guided progression, which address the challenges of sparse supervision and enhance long-term attack strategy development.
  • Extensive experimental validation showing RL-MTJail consistently outperforms existing baselines in attack success rates across diverse language models and benchmarks, indicating its effectiveness and adaptability.
  • The paper highlights the importance of optimizing across entire interaction trajectories rather than single turns, showcasing the benefits of trajectory-level learning for effective attack strategies.

💡 Why This Paper Matters

This paper is significant as it advances the state-of-the-art in understanding and conducting multi-turn attacks on large language models, presenting a novel approach that enhances the potential for eliciting harmful responses without being restricted to immediate outputs. The findings underline the vulnerabilities of current AI systems in real-world applications and the critical need for effective countermeasures against automated black-box attacks.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of great interest due to its exploration of sophisticated attack methodologies that exploit the weaknesses of language models. The novel techniques proposed can inform both defense strategies against such attacks and the understanding of model vulnerabilities, contributing to the broader discourse on AI safety and responsible deployment.

📚 Read the Full Paper