← Back to Library

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Authors: Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park

Published: 2025-09-10

arXiv ID: 2509.08729v1

Added to Library: 2025-09-11 04:00 UTC

Red Teaming

📄 Abstract

Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to $\theta = 0.70$, we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at https://github.com/hyunjun1121/M2S-x-teaming.

🔍 Key Points

  • Introduction of X-Teaming Evolutionary M2S framework for automated discovery of M2S templates.
  • Achieved 44.8% overall success rate via five generations of template evolution while maintaining strict selection pressure.
  • Demonstrated a positive correlation between prompt length and effectiveness of template prompts, leading to recommendations for length-aware judging.
  • Established the importance of threshold calibration in template evaluation, showcasing how stricter criteria can foster more meaningful improvements in template structure.
  • Cross-model evaluation indicated variability in template performance, stressing the necessity of diverse validation methods.

💡 Why This Paper Matters

This paper presents a significant advancement in the automated generation and optimization of multi-turn-to-single-turn jailbreak templates, thereby enhancing the reproducibility and efficiency of adversarial probing for language models. Its findings underline critical considerations for setting evaluation thresholds and demonstrate a robust framework for continual improvement in prompt structures.

🎯 Why It's Interesting for AI Security Researchers

The research is particularly relevant to AI security researchers who are focused on understanding vulnerabilities in language models and developing defenses against misuse. The innovative approach of using LLM-as-judge and evolutionary methods for template generation is a novel contribution that can lead to more effective adversarial testing and ensures better alignment of security measures in AI systems.

📚 Read the Full Paper