← Back to Library

MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning

Authors: Muyang Zheng, Yuanzhi Yao, Changting Lin, Rui Wang, Meng Han

Published: 2025-06-20

arXiv ID: 2506.16792v1

Added to Library: 2025-06-23 04:01 UTC

Red Teaming

📄 Abstract

Despite efforts to align large language models (LLMs) with societal and moral values, these models remain susceptible to jailbreak attacks--methods designed to elicit harmful responses. Jailbreaking black-box LLMs is considered challenging due to the discrete nature of token inputs, restricted access to the target LLM, and limited query budget. To address the issues above, we propose an effective method for jailbreaking black-box large language Models via Iterative Semantic Tuning, named MIST. MIST enables attackers to iteratively refine prompts that preserve the original semantic intent while inducing harmful content. Specifically, to balance semantic similarity with computational efficiency, MIST incorporates two key strategies: sequential synonym search, and its advanced version--order-determining optimization. Extensive experiments across two open-source models and four closed-source models demonstrate that MIST achieves competitive attack success rates and attack transferability compared with other state-of-the-art white-box and black-box jailbreak methods. Additionally, we conduct experiments on computational efficiency to validate the practical viability of MIST.

🔍 Key Points

  • Introduction of MIST, an effective black-box jailbreak method for LLMs utilizing Iterative Semantic Tuning techniques.
  • Development of two innovative strategies: Sequential Synonym Search and Order-Determining Optimization to refine prompts while preserving semantic intent.
  • Comprehensive evaluation of MIST, demonstrating superior attack success rates and robust performance against state-of-the-art jailbreak methods across multiple LLMs.
  • Validation of improved computational efficiency with significantly fewer query counts compared to existing black-box jailbreak attacks.
  • Experiments revealing strong transferability of MIST-generated prompts across various closed-source and open-source LLMs.

💡 Why This Paper Matters

This paper presents MIST, a novel method that effectively circumvents the defenses of large language models via Italian Semantic Tuning. Its introduction of refined techniques not only enhances the precision of jailbreak prompts but also optimizes query efficiency, showcasing its potential impact on AI model vulnerabilities.

🎯 Why It's Interesting for AI Security Researchers

MIST is particularly relevant for AI security researchers as it sheds light on the vulnerabilities of language models in black-box settings. Its findings and methodologies could inform the development of more effective defenses against adversarial attacks and enhance the understanding of the security landscape surrounding large language models.

📚 Read the Full Paper