Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

Authors: Wenjing Hong, Zhonghua Rong, Li Wang, Feng Chang, Jian Zhu, Ke Tang, Zexuan Zhu, Yew-Soon Ong

Published: 2026-03-20

arXiv ID: 2603.20122v1

Added to Library: 2026-03-23 03:00 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) have been widely deployed, especially through free Web-based applications that expose them to diverse user-generated inputs, including those from long-tail distributions such as low-resource languages and encrypted private data. This open-ended exposure increases the risk of jailbreak attacks that undermine model safety alignment. While recent studies have shown that leveraging long-tail distributions can facilitate such jailbreaks, existing approaches largely rely on handcrafted rules, limiting the systematic evaluation of these security and privacy vulnerabilities. In this work, we present EvoJail, an automated framework for discovering long-tail distribution attacks via multi-objective evolutionary search. EvoJail formulates long-tail attack prompt generation as a multi-objective optimization problem that jointly maximizes attack effectiveness and minimizes output perplexity, and introduces a semantic-algorithmic solution representation to capture both high-level semantic intent and low-level structural transformations of encryption-decryption logic. Building upon this representation, EvoJail integrates LLM-assisted operators into a multi-objective evolutionary framework, enabling adaptive and semantically informed mutation and crossover for efficiently exploring a highly structured and open-ended search space. Extensive experiments demonstrate that EvoJail consistently discovers diverse and effective long-tail jailbreak strategies, achieving competitive performance with existing methods in both individual and ensemble level.

🔍 Key Points

Introduction of EvoJail, an automated framework for generating long-tail distribution attacks on LLMs using multi-objective evolutionary search, eliminating reliance on handcrafted rules.
Formulation of long-tail attacks as a multi-objective optimization problem that balances attack effectiveness and output quality through a novel semantic-algorithmic solution representation.
Integration of LLM-assisted operators into the EvoJail framework for adaptive mutation, crossover, and repair processes, improving exploration in the highly structured prompt space.
Extensive experimental results showing EvoJail's competitive performance against existing methods, demonstrating superior attack efficacy and output quality across various models and scenarios.
Demonstration of how EvoJail facilitates systematic evaluation of LLM vulnerabilities and underscores the need for automated tools in assessing AI safety.

💡 Why This Paper Matters

This paper is significant as it addresses the pressing issue of jailbreaking large language models by providing an innovative framework that automates the discovery of vulnerabilities through evolutionary algorithms. By eliminating manual rule crafting and enhancing exploration strategies, EvoJail represents a critical advancement in AI security methodologies and contributes to a more comprehensive understanding of LLM safety.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper highlights a novel approach to evaluating the robustness of large language models, an increasingly critical area as these models become more integrated into applications where safety and alignment are paramount. The techniques developed in this research could potentially inform the design of stronger defense mechanisms against jailbreak attacks, making it a valuable resource for professionals focused on enhancing AI reliability and security.

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper