MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

Authors: Siyu Yan, Long Zeng, Xuecheng Wu, Chengcheng Han, Kongcheng Zhang, Chong Peng, Xuezhi Cao, Xunliang Cai, Chenjuan Guo

Published: 2025-09-18

arXiv ID: 2509.14651v1

Added to Library: 2025-09-19 04:01 UTC

Red Teaming

📄 Abstract

As large language models~(LLMs) become widely adopted, ensuring their alignment with human values is crucial to prevent jailbreaks where adversaries manipulate models to produce harmful content. While most defenses target single-turn attacks, real-world usage often involves multi-turn dialogues, exposing models to attacks that exploit conversational context to bypass safety measures. We introduce MUSE, a comprehensive framework tackling multi-turn jailbreaks from both attack and defense angles. For attacks, we propose MUSE-A, a method that uses frame semantics and heuristic tree search to explore diverse semantic trajectories. For defense, we present MUSE-D, a fine-grained safety alignment approach that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on various models show that MUSE effectively identifies and mitigates multi-turn vulnerabilities. Code is available at \href{https://github.com/yansiyu02/MUSE}{https://github.com/yansiyu02/MUSE}.

🔍 Key Points

Introduces MUSE, a dual approach for enhancing multi-turn dialogue safety through systematic red teaming, addressing vulnerabilities in large language models (LLMs).
MUSE-A employs frame semantics and Monte Carlo Tree Search (MCTS) to identify diverse semantic attack trajectories, unveiling new types of multi-turn jailbreaks.
MUSE-D introduces a fine-grained safety alignment method that optimizes model defense using data generated during attack scenarios, enhancing vulnerability mitigation without sacrificing performance.
Extensive experiments demonstrate that MUSE significantly improves attack success rates against state-of-the-art models while providing robust defenses, surpassing existing methods in effectiveness across both single-turn and multi-turn contexts.
The proposed framework is publicly available, promoting transparency and collaboration in advancing AI safety research.

💡 Why This Paper Matters

This paper is crucial as it provides innovative and systematic methodologies for identifying and mitigating vulnerabilities in large language models during multi-turn dialogues. By focusing on both attack and defense mechanisms, the research underscores the need for comprehensive approaches to AI safety, particularly as LLMs continue to be deployed in sensitive applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it addresses the pressing issue of model safety in the context of multi-turn dialogues, where vulnerabilities can be exploited more effectively compared to single-turn interactions. The introduction of MUSE as a framework for systematic exploration of both attacks and defenses presents valuable insights and tools for advancing secure AI systems, making it a significant contribution to the field.

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper